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EDITORS'  PREFACE 

The  increasing  specialisation  in  biological  inquiry- 
has  made  it  impossible  for  any  one  author  to  deal 
adequately  with  current  advances  in  knowledge.  It 
has  become  a  matter  of  considerable  difficulty  for  a 
research  student  to  gain  a  correct  idea  of  the  present 
state  of  knowledge  of  a  subject  in  which  he  himself  is 
interested.  To  meet  this  situation  the  text-book  is 
being  supplemented  by  the  monograph. 

The  aim  of  the  present  series  is  to  provide  authori- 
tative accounts  of  what  has  been  done  in  some  of  the 
diverse  branches  of  biological  investigation,  and  at 
the  same  time  to  give  to  those  who  have  contributed 
notably  to  the  development  of  a  particular  field  of 
inquiry  an  opportunity  of  presenting  the  results  of 
their  researches,  scattered  throughout  the  scientific 
journals,  in  a  more  extended  form,  showing,  their 
relation  to  what  has  already  been  done  and  to 
problems  that  remain  to  be  solved. 

The  present  generation  is  witnessing  "  a  return  to 
the  practice  of  older  days  when  animal  physiology 
was  not  yet  divorced  from  morphology."  Con- 
spicuous progress  is  now  being  seen  in  the  field  of 
general  physiology,  of  experimental  biology,  and  in 
the  application  of  biological  principles  to  economic 
problems.  Often  the  analysis  of  large  masses  of 
data  by  statistical  methods  is  necessary,  and  the  bio- 
logical worker  is  continually  encountering  advanced 
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statistical  problems  the  adequate  solutions  of  which 
are  not  found  in  current  statistical  text-books.  To 
meet  these  needs  the  present  monograph  was  pre- 
pared, and  the  early  call  for  a  second  edition 
indicates  the  success  attained  by  the  author  in 
his  project. 

F.  A.  E.  C. 
D.  W.  C. 


PREFACE  TO  FIRST  EDITION 

For  several  years  the  author  has  been  working  in 
somewhat  intimate  co-operation  with  a  number  of 
biological  research  departments  ;  the  present  book  is 
in  every  sense  the  product  of  this  circumstance.  Daily 
contact  with  the  statistical  problems  which  present 
themselves  to  the  laboratory  worker  has  stimulated 
the  purely  mathematical  researches  upon  which  are 
based  the  methods  here  presented.  Little  experience 
is  sufficient  to  show  that  the  traditional  machinery  of 
statistical  processes  is  wholly  unsuited  to  the  needs  of 
practical  research.  Not  only  does  it  take  a  cannon 
to  shoot  a  sparrow,  but  it  misses  the  sparrow  !  The 
elaborate  mechanism  built  on  the  theory  of  infinitely 
large  samples  is  not  accurate  enough  for  simple  labora- 
tory data.  Only  by  systematically  tackling  small 
sample  problems  on  their  merits  does  it  seem  possible 
to  apply  accurate  tests  to  practical  data.  Such  at 
least  has  been  the  aim  of  this  book. 

I  owe  more  than  I  can  say  to  Mr.  W.  S.  Gosset, 
Mr.  E.  Somerfield,  and  Miss  W.  A.  Mackenzie,  who 
have  read  the  proofs  and  made  many  valuable  sugges- 
tions. Many  small  but  none  the  less  troublesome 
errors  have  been  removed  ;  I  shall  be  grateful  to 
readers  who  will  notify  me  of  any  further  errors  and 
ambiguities  they  may  detect. 

ROTHAMSTED    EXPERIMENTAL    STATION. 
February  1925. 


PREFACE  TO  SECOND  EDITION 

The  early  demand  for  a  new  edition  has  more  than 
justified  the  author's  hope  that  use  could  be  made 
of  a  book  which,  without  entering  into  the  mathe- 
matical theory  of  statistical  methods,  should  embody 
the  latest  results  of  that  theory,  presenting  them  in 
the  form  of  practical  procedures  appropriate  to  those 
types  of  data  with  which  research  workers  are  actually 
concerned. 

Those  critics  who  would  like  to  have  seen  the 
inclusion  of  mathematical  proofs  of  the  more  important 
propositions  of  the  underlying  theory,  must  still  be 
referred  to  the  technical  publications  given  in  the  list 
of  sources.  There  they  will  encounter  exactly  those 
difficulties  which  it  would  be  undesirable  to  import 
into  the  present  work  ;  and  will  perceive  that  modern 
statistics  could  not  have  been  developed  without  the 
elaboration  of  a  system  of  ideas,  logical  and  mathe- 
matical, which,  however  fascinating  in  themselves, 
cannot  be  regarded  as  a  necessary  part  of  the  equip- 
ment of  every  research  worker. 

To  present  "  elementary  proofs,"  of  the  kind 
which  do  not  involve  these  ideas,  would  be  really  to 
justify  the  censure  of  a  second  school  of  critics,  who, 
rightly  feeling  that  a  fallacious  proof  is  worse  than 
none,  are  eager  to  decry  any  attempt  to  "  teach 
people  to  run  before  they  can  walk."  The  actual 
scope  of  the  present  volume  really  exempts  it  from 
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this  criticism,  which,  besides,  in  an  age  of  technical 
co-operation,  has  seldom  much  force.  The  practical 
application  of  general  theorems  is  a  different  art  from 
their  establishment  by  mathematical  proof,  and  one 
useful  to  many  to  whom  the  other  is  unnecessary. 

The  importance  of  providing  a  striking  and  de- 
tailed illustration  of  the  principles  of  statistical  estima- 
tion has  led  to  the  addition  of  a  ninth  chapter.  This 
subject  received  only  general  discussion  in  the  first 
edition,  and,  in  spite  of  its  practical  importance,  has 
nowhere  been  made  available  to  the  general  student. 
The  new  chapter  supersedes  Section  6  and  Example  i 
of  the  first  edition.  The  numbers  of  other  sections, 
tables,  and  examples  have  not  been  altered,  so  that 
references  to  their  numbers,  though  not  to  those  of 
pages,  will  be  valid  irrespective  of  the  edition  used. 

A  new  section,  28-1,  has  been  added  to  Chapter  V., 
giving  a  much  abbreviated  method  of  constructing  the 
polynomial  values  fitted  to  a  series  ;  Table  VI.,  the 
table  of  2,  has  also  been  much  enlarged,  and  should 
now  be  fully  sufficient  for  the  great  range  of  practical 
problems  for  which  it  is  required. 

With  respect  to  the  folding  copies  of  tables  bound 
with  the  book,  it  may  be  mentioned  that  many 
laboratory  workers,  who  have  occasion  to  use  them 
constantly,  have  found  it  convenient  to  mount  these 
on  the  faces  of  triangular  or  square  prisms,  which 
may  be  kept  near  at  hand  for  immediate  reference. 
The  practical  advantages  of  this  plan  have  made  it 
seem  worth  while  to  bring  it  to  the  notice  of  all  readers. 

ROTHAMSTED.  February  1928. 


CONTENTS 


Editors'  Preface 
Preface  to  First  Edition 
Preface  to  Second  Edition 
I.  Introductory     . 
II.  Diagrams 

III.  Distributions 

IV.  Tests    of    Goodness    of    Fit,    Independence    and 

Homogeneity  ;  with  Table  of  x2 
V.  Tests  of  Significance  of  Means,  Differences  of 

Means,  and  Regression  Coefficients     . 
VI.  The  Correlation  Coefficient 

VII.  Intraclass    Correlations    and    the    Analysis    of 
Variance     ...... 

VIII.  Further  Applications  of  the  Analysis  of  Variance 
IX.  The  Principles  of  Statistical  Estimation     . 
Sources  used  for  Data  and  Methods 
Index      ....... 


PAGE 
V 

vii 
ix 

i 
25 
41 

75 

99 
140 

178 
216 
238 
263 
267 


tables 

I.  and  II.  Normal  Distribution  . 

III.  Table  of  x2 

IV.  Table  of  /  . 

V.a.  Correlation  Coefficient — Significant  Values 
V.b.  Correlation  Coefficient — Transformed  Values 
VI.  Table  of  z 


741 

96 

^ 

^ 

139 

-0 

.     176 

■* 

jj 

q 

177 

212-215 

Statistical    Methods 
for  Research  Workers 


STATISTICAL    METHODS    FOR 
RESEARCH    WORKERS 

I 

INTRODUCTORY 
1.  The  Scope  of  Statistics 

The  science  of  statistics  is  essentially  a  branch  of 
Applied  Mathematics,  and  may  be  regarded  as 
mathematics  applied  to  observational  data.  As  in 
other  mathematical  studies  the  same  formula  is  equally 
relevant  to  widely  different  groups  of  subject  matter. 
Consequently  the  unity  of  the  different  applications 
has  usually  been  overlooked,  the  more  naturally 
because  the  development  of  the  underlying  mathe- 
matical theory  has  been  much  neglected.  We  shall 
therefore  consider  the  subject  matter  of  statistics 
under  three  different  aspects,  and  then  show  in  more 
mathematical  language  that  the  same  types  of  prob- 
lems arise  in  every  case.  Statistics  may  be  regarded 
as  (i.)  the  study  of  populations,  (ii.)  as  the  study 
of  variation,  (iii.)  as  the  study  of  methods  of  the 
reduction  of  data. 

The  original   meaning  of  the  word   "  statistics  " 
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suggests  that  it  was  the  study  of  populations  of  human 
beings  living  in  political  union.  The  methods 
developed,  however,  have  nothing  to  do  with  the 
political  unity  of  the  group,  and  are  not  confined 
to  populations  of  men  or  of  social  insects.  Indeed, 
since  no  observational  record  can  completely  specify 
a  human  being,  the  populations  studied  are  always 
to  some  extent  abstractions.  If  we  have  records  of 
the  stature  of  10,000  recruits,  it  is  rather  the  popula- 
tion of  statures  than  the  population  of  recruits  that  is 
open  to  study.  Nevertheless,  in  a  real  sense,  statistics 
is  the  study  of  populations,  or  aggregates  of  indi- 
viduals, rather  than  of  individuals.  Scientific  theories 
which  involve  the  properties  of  large  aggregates  of 
individuals,  and  not  necessarily  the  properties  of  the 
individuals  themselves,  such  as  the  Kinetic  Theory 
of  Gases,  the  Theory  of  Natural  Selection,  or  the 
chemical  Theory  of  Mass  Action,  are  essentially 
statistical  arguments,  and  are  liable  to  misinterpreta- 
tion as  soon  as  the  statistical  nature  of  the  argument 
is  lost  sight  of.  Statistical  methods  are  essential  to 
social  studies,  and  it  is  principally  by  the  aid  of  such 
methods  that  these  studies  may  be  raised  to  the  rank 
of  sciences.  This  particular  dependence  of  social 
studies  upon  statistical  methods  has  led  to  the  painful 
misapprehension  that  statistics  is  to  be  regarded  as 
a  branch  of  economics,  whereas  in  truth  methods 
adequate  to  the  treatment  of  economic  data,  in  so  far 
as  these  exist,  have  only  been  developed  in  the  study 
of  biology  and  the  other  sciences. 

The  idea  of  a  population  is  to  be  applied  not  only 
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to  living,  or  even  material,  individuals.  If  an  observa- 
tion, such  as  a  simple  measurement,  be  repeated 
indefinitely,  the  aggregate  of  the  results  is  a  popu- 
lation of  measurements.  Such  populations  are  the 
particular  field  of  study  of  the  Theory  of  Errors,  one 
of  the  oldest  and  most  fruitful  lines  of  statistical  in- 
vestigation. Just  as  a  single  observation  may  be 
regarded  as  an  individual,  and  its  repetition  as  generat- 
ing a  population,  so  the  entire  result  of  an  extensive 
experiment  may  be  regarded  as  but  one  of  a  popu- 
lation of  such  experiments.  The  salutary  habit  of 
repeating  important  experiments,  or  of  carrying  out 
original  observations  in  replicate,  shows  a  tacit 
appreciation  of  the  fact  that  the  object  of  our  study  is 
not  the  individual  result,  but  the  population  of  possi- 
bilities of  which  we  do  our  best  to  make  our  experiments 
representative.  The  calculation  of  means  and  probable 
errors  shows  a  deliberate  attempt  to  learn  something 
about  that  population. 

The  conception  of  statistics  as  the  study  of  varia- 
tion is  the  natural  outcome  of  viewing  the  subject  as 
the  study  of  populations  ;  for  a  population  of  indi- 
viduals in  all  respects  identical  is  completely  described 
by  a  description  of  any  one  individual,  together  with 
the  number  in  the  group.  The  populations  which 
are  the  object  of  statistical  study  always  display 
variation  in  one  or  more  respects.  To  speak  of 
statistics  as  the  study  of  variation  also  serves  to 
emphasise  the  contrast  between  the  aims  of  modern 
statisticians  and  those  of  their  predecessors.  For, 
until  comparatively  recent  times,   the  vast  majority 
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of  workers  in  this  field  appear  to  have  had  no  other 
aim  than  to  ascertain  aggregate,  or  average,  values. 
The  variation  itself  was  not  an  object  of  study,  but 
was  recognised  rather  as  a  troublesome  circumstance 
which  detracted  from  the  value  of  the  average.  The 
error  curve  of  the  mean  of  a  normal  sample  has  been 
familiar  for  a  century,  but  that  of  the  standard  devia- 
tion was  the  object  of  researches  up  to  191 5.  Yet, 
from  the  modern  point  of  view,  the  study  of  the  causes 
of  variation  of  any  variable  phenomenon,  from  the 
yield  of  wheat  to  the  intellect  of  man,  should  be  begun 
by  the  examination  and  measurement  of  the  variation 
which  presents  itself. 

The  study  of  variation  leads  immediately  to  the 
concept  of  a  frequency  distribution.  Frequency  dis- 
tributions are  of  various  kinds,  according  as  the 
number  of  classes  in  which  the  population  is  distri- 
buted is  finite  or  infinite,  and  also  according  as  the 
intervals  which  separate  the  classes  are  finite  or 
infinitesimal.  In  the  simplest  possible  case,  in  which 
there  are  only  two  classes,  such  as  male  and  female 
births,  the  distribution  is  simply  specified  by  the  pro- 
portion in  which  these  occur,  as  for  example  by  the 
statement  that  51  per  cent  of  the  births  are  of  males 
and  49  per  cent  of  females.  In  other  cases  the  varia- 
tion may  be  discontinuous,  but  the  number  of  classes 
indefinite,  as  with  the  number  of  children  born  to 
different  married  couples  ;  the  frequency  distribution 
would  then  show  the  frequency  with  which  o,  1,  2  .  .  . 
children  were  recorded,  the  number  of  classes  being 
sufficient  to  include  the  largest  family  in  the  record. 
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The  variable  quantity,  such  as  the  number  of  children, 
is  called  the  variate,  and  the  frequency  distribution 
specifies  how  frequently  the  variate  takes  each  of  its 
possible  values.  In  the  third  group  of  cases,  the 
variate,  such  as  human  stature,  may  take  any  inter- 
mediate value  within  its  range  of  variation  ;  the 
variate  is  then  said  to  vary  continuously,  and  the 
frequency  distribution  may  be  expressed  by  stating,  as 
a  mathematical  function  of  the  variate,  either  (i.)  the 
proportion  of  the  population  for  which  the  variate  is 
less  than  any  given  value,  or  (ii.)  by  the  mathematical 
device  of  differentiating  this  function,  the  (infinitesimal) 
proportion  of  the  population  for  which  the  variate 
falls  within  any  infinitesimal  element  of  its  range. 

The  idea  of  a  frequency  distribution  is  applicable 
either  to  populations  which  are  finite  in  number,  or  to 
infinite  populations,  but  it  is  more  usefully  and  more 
simply  applied  to  the  latter.  A  finite  population  can 
only  be  divided  in  certain  limited  ratios,  and  cannot  in 
any  case  exhibit  continuous  variation.  Moreover,  in 
most  cases  only  an  infinite  population  can  exhibit 
accurately,  and  in  their  true  proportion,  the  whole  of 
the  possibilities  arising  from  the  causes  actually  at 
work,  and  which  we  wish  to  study.  The  actual 
observations  can  only  be  a  sample  of  such  possibilities. 
With  an  infinite  population  the  frequency  distribution 
specifies  the  fractions  of  the  population  assigned  to 
the  several  classes  ;  we  may  have  (i.)  a  finite  number 
of  fractions  adding  up  to  unity  as  in  the  Mendelian 
frequency  distributions,  or  (ii.)  an  infinite  series  of 
finite  fractions  adding  up  to  unity,  or  (iii.)  a  mathe- 
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matical  function  expressing  the  fraction  of  the  total 
in  each  of  the  infinitesimal  elements  in  which  the  range 
of  the  variate  may  be  divided.  The  last  possibility 
may  be  represented  by  a  frequency  curve  ;  the  values 
of  the  variate  are  set  out  along  a  horizontal  axis,  the 
fraction  of  the  total  population,  within  any  limits  of 
the  variate,  being  represented  by  the  area  of  the  curve 
standing  on  the  corresponding  length  of  the  axis.  It 
should  be  noted  that  the  familiar  concept  of  the 
frequency  curve  is  only  applicable  to  an  infinite 
population  with  continuous  variate. 

The  study  of  variation  has  led  not  merely  to 
measurement  of  the  amount  of  variation  present,  but 
to  the  study  of  the  qualitative  problems  of  the  type,  or 
form,  of  the  variation.  Especially  important  is  the 
study  of  the  simultaneous  variation  of  two  or  more 
variates.  This  study,  arising  principally  out  of  the 
works  of  Galton  and  Pearson,  is  generally  known  in 
English  under  the  name  of  Correlation,  but  by  some 
continental  writers  as  Covariation. 

The  third  aspect  under  which  we  shall  regard  the 
scope  of  statistics  is  introduced  by  the  practical  need 
to  reduce  the  bulk  of  any  given  body  of  data.  Any 
investigator  who  has  carried  out  methodical  and 
extensive  observations  will  probably  be  familiar  with 
the  oppressive  necessity  of  reducing  his  results  to  a 
more  convenient  bulk.  No  human  mind  is  capable  of 
grasping  in  its  entirety  the  meaning  of  any  consider- 
able quantity  of  numerical  data.  We  want  to  be  able 
to  express  all  the  relevant  information  contained  in 
the  mass  by  means  of  comparatively  few  numerical 
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values.  This  is  a  purely  practical  need  which  the 
science  of  statistics  is  able  to  some  extent  to  meet.  In 
some  cases  at  any  rate  it  is  possible  to  give  the  whole 
of  the  relevant  information  by  means  of  one  or  a  few 
values.  In  all  cases,  perhaps,  it  is  possible  to  reduce 
to  a  simple  numerical  form  the  main  issues  which  the 
investigator  has  in  view,  in  so  far  as  the  data  are  com- 
petent to  throw  light  on  such  issues.  The  number  of 
independent  facts  supplied  by  the  data  is  usually  far 
greater  than  the  number  of  facts  sought,  and  in  conse- 
quence much  of  the  information  supplied  by  any  body 
of  actual  data  is  irrelevant.  It  is  the  object  of  the 
statistical  processes  employed  in  the  reduction  of  data 
to  exclude  this  irrelevant  information,  and  to  isolate 
the  whole  of  the  relevant  information  contained  in  the 
data. 

2.  General  Method,  Calculation  of  Statistics 

The  discrimination  between  the  irrelevant  and  the 
relevant  information  is  performed  as  follows.  Even 
in  the  simplest  cases  the  values  (or  sets  of  values) 
before  us  are  interpreted  as  a  random  sample  of  a 
hypothetical  infinite  population  of  such  values  as 
might  have  arisen  in  the  same  circumstances.  The 
distribution  of  this  population  will  be  capable  of  some 
kind  of  mathematical  specification,  involving  a  certain 
number,  usually  few,  of  parameters,  or  "  constants  " 
entering  into  the  mathematical  formula.  These  para- 
meters are  the  characters  of  the  population.  If  we 
could  know  the  exact  values  of  the  parameters,  we 
should  know  all   (and  more  than)   any  sample  from 
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the  population  could  tell  us.  We  cannot  in  fact  know 
the  parameters  exactly,  but  we  can  make  estimates 
of  their  values,  which  will  be  more  or  less  inexact. 
These  estimates,  which  are  termed  statistics,  are  of 
course  calculated  from  the  observations.  If  we  can 
find  a  mathematical  form  for  the  population  which 
adequately  represents  the  data,  and  then  calculate 
from  the  data  the  best  possible  estimates  of  the  re- 
quired parameters,  then  it  would  seem  that  there  is 
little,  or  nothing,  more  that  the  data  can  tell  us  ;  we 
shall  have  extracted  from  it  all  the  available  relevant 
information. 

The  value  of  such  estimates  as  we  can  make  is 
enormously  increased  if  we  can  calculate  the  magnitude 
and  nature  of  the  errors  to  which  they  are  subject.  If 
we  can  rely  upon  the  specification  adopted,  this  pre- 
sents the  purely  mathematical  problem  of  deducing 
from  the  nature  of  the  population  what  will  be  the 
behaviour  of  each  of  the  possible  statistics  which  can 
be  calculated.  This  type  of  problem,  with  which  until 
recent  years  comparatively  little  progress  had  been 
made,  is  the  basis  of  the  tests  of  significance  by  which 
we  can  examine  whether  or  not  the  data  are  in  harmony 
with  any  suggested  hypothesis.  In  particular,  it  is 
necessary  to  test  the  adequacy  of  the  hypothetical 
specification  of  the  population  upon  which  the  method 
of  reduction  was  based. 

The  problems  which  arise  in  the  reduction  of  data 
may  thus  conveniently  be  divided  into  three  types  : 

(i.)  Problems  of  Specification,  which  arise  in  the 
choice  of  the  mathematical  form  of  the  population. 


INTRODUCTORY  9 

(ii.)  Problems  of  Estimation,  which  involve  the 
choice  of  method  of  calculating,  from  our  sample, 
statistics  fit  to  estimate  the  unknown  parameters  of 
the  population. 

(iii.)  Problems  of  Distribution,  which  include  the 
mathematical  deduction  of  the  exact  nature  of  the 
distribution  in  random  samples  of  our  estimates  of  the 
parameters,  and  of  other  statistics  designed  to  test 
the  validity  of  our  specification  (tests  of  Goodness  of 
Fit). 

The  statistical  examination  of  a  body  of  data  is 
thus  logically  similar  to  the  general  alternation  of 
inductive  and  deductive  methods  throughout  the 
sciences.  A  hypothesis  is  conceived  and  defined  with 
all  necessary  exactitude  ;  its  logical  consequences  are 
ascertained  by  a  deductive  argument  ;  these  conse- 
quences are  compared  with  the  available  observations  ; 
if  these  are  completely  in  accord  with  the  deductions, 
the  hypothesis  is  justified  at  least  until  fresh  and  more 
stringent  observations  are  available. 

The  deduction  of  inferences  respecting  samples, 
from  assumptions  respecting  the  populations  from 
which  they  are  drawn,  shows  us  the  position  in 
Statistics  of  the  Theory  of  Probability.  For  a  given 
population  we  may  calculate  the  probability  with 
which  any  given  sample  will  occur,  and  if  we  can 
solve  the  purely  mathematical  problem  presented,  we 
can  calculate  the  probability  of  occurrence  of  any 
given  statistic  calculated  from  such  a  sample.  The 
problems  of  distribution  may  in  fact  be  regarded  as 
applications   and   extensions   of  the   theory   of  prob- 
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ability.  Three  of  the  distributions  with  which  we 
shall  be  concerned,  Bernoulli's  binomial  distribution, 
Laplace's  normal  distribution,  and  Poisson's  series, 
were  developed  by  writers  on  probability.  For  many 
years,  extending  over  a  century  and  a  half,  attempts 
were  made  to  extend  the  domain  of  the  idea  of  prob- 
ability to  the  deduction  of  inferences  respecting 
populations  from  assumptions  (or  observations)  re- 
specting samples.  Such  inferences  are  usually  dis- 
tinguished under  the  heading  of  Inverse  Probability, 
and  have  at  times  gained  wide  acceptance.  This  is 
not  the  place  to  enter  into  the  subtleties  of  a  prolonged 
controversy  ;  it  will  be  sufficient  in  this  general 
outline  of  the  scope  of  Statistical  Science  to  express 
my  personal  conviction,  which  I  have  sustained  else- 
where, that  the  theory  of  inverse  probability  is  founded 
upon  an  error,  and  must  be  wholly  rejected.  In- 
ferences respecting  populations,  from  which  known 
samples  have  been  drawn,  cannot  be  expressed  in 
terms  of  probability,  except  in  the  trivial  case  when  the 
population  is  itself  a  sample  of  a  super-population  the 
specification  of  which  is  known  with  accuracy. 

This  is  not  to  say  that  we  cannot  draw,  from  know- 
ledge of  a  sample,  inferences  respecting  the  corre- 
sponding population.  Such  a  view  would  entirely  deny 
validity  to  all  experimental  science.  What  is  essen- 
tial is  that  the  mathematical  concept  of  probability  is 
inadequate  to  express  our  mental  confidence  or  diffi- 
dence in  making  such  inferences,  and  that  the  mathe- 
matical quantity  which  appears  to  be  appropriate  for 
measuring   our  order  of  preference   among   different 
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possible  populations  does  not  in  fact  obey  the  laws 
of  probability.  To  distinguish  it  from  probability,  I 
have  used  the  term  "  Likelihood "  to  designate  this 
quantity  ;  since  both  the  words  "  likelihood  "  and 
"  probability  "  are  loosely  used  in  common  speech  to 
cover  both  kinds  of  relationship. 

3.  The  Qualifications  of  Satisfactory  Statistics 

The  solutions  of  problems  of  distribution  (which 
may  be  regarded  as  purely  deductive  problems  in  the 
theory  of  probability)  not  only  enable  us  to  make 
critical  tests  of  the  significance  of  statistical  results,  and 
of  the  adequacy  of  the  hypothetical  distribution  upon 
which  our  methods  of  numerical  deduction  are  based, 
but  afford  some  guidance  in  the  choice  of  appropriate 
statistics  for  purposes  of  estimation.  Such  statistics 
may  be  divided  into  classes  according  to  the  behaviour 
of  their  distributions  in  large  samples. 

If  we  calculate  a  statistic,  such,  for  example,  as  the 
mean,  from  a  very  large  sample,  we  are  accustomed  to 
ascribe  to  it  great  accuracy  ;  and  indeed  it  will  usually, 
but  not  always,  be  true,  that  if  a  number  of  such 
statistics  can  be  obtained  and  compared,  the  discrep- 
ancies between  them  will  grow  less  and  less,  as  the 
samples  from  which  they  are  drawn  are  made  larger 
and  larger.  In  fact,  as  the  samples  are  made  larger 
without  limit,  the  statistic  will  usually  tend  to  some 
fixed  value  characteristic  of  the  population,  and,  there- 
fore, expressible  in  terms  of  the  parameters  of  the 
population.  If,  therefore,  such  a  statistic  is  to  be  used 
to  estimate  these  parameters,  there  is  only  one  para- 
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metric  function  to  which  it  can  properly  be  equated. 
If  it  be  equated  to  some  other  parametric  function,  we 
shall  be  using  a  statistic  which  even  from  an  infinite 
sample  does  not  give  the  correct  value  ;  it  tends 
indeed  to  a  fixed  value,  but  to  a  value  which  is  errone- 
ous from  the  point  of  view  with  which  it  was  used. 
Such  statistics  are  termed  Inconsistent  Statistics  ; 
except  when  the  error  is  extremely  minute,  as  in  the 
use  of  Sheppard's  corrections,  inconsistent  statistics 
should  be  regarded  as  outside  the  pale  of  decent  usage. 

Consistent  statistics,  on  the  other  hand,  all  tend 
more  and  more  nearly  to  give  the  correct  values,  as 
the  sample  is  more  and  more  increased  ;  at  any  rate, 
if  they  tend  to  any  fixed  value  it  is  not  to  an  incorrect 
one.  In  the  simplest  cases,  with  which  we  shall  be 
concerned,  they  not  only  tend  to  give  the  correct 
value,  but  the  errors,  for  samples  of  a  given  size,  tend 
to  be  distributed  in  a  well-known  distribution  (of  which 
more  in  Chap.  III.)  known  as  the  Normal  Law  of 
Frequency  of  Error,  or  more  simply  as  the  normal 
distribution.  The  liability  to  error  may,  in  such  cases, 
be  expressed  by  calculating  the  mean  value  of  the 
squares  of  these  errors,  a  value  which  is  known  as  the 
variance  ;  and  in  the  class  of  cases  with  which  we  are 
concerned,  the  variance  falls  off  with  increasing 
samples,  in  inverse  proportion  to  the  number  in  the 
sample. 

Now,  for  the  purpose  of  estimating  any  parameter, 
it  is  usually  possible  to  invent  any  number  of  statistics 
which  shall  be  consistent  in  the  sense  defined  above, 
and  each  of  which  has  in  large  samples  a  variance 
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falling  off  inversely  with  the  size  of  the  sample.  But 
for  large  samples  of  a  fixed  size,  the  variance  of  these 
different  statistics  will  generally  be  different.  Conse- 
quently a  special  importance  belongs  to  a  smaller 
group  of  statistics,  the  error  distributions  of  which  tend 
to  the  normal  distribution,  as  the  sample  is  increased, 
with  the  least  possible  variance.  We  may  thus  separate 
off  from  the  general  body  of  consistent  statistics  a 
group  of  especial  value,  and  these  are  known  as 
efficient  statistics. 

The  reason  for  this  term  may  be  made  apparent  by 
an  example.  If  from  a  large  sample  of  (say)  1000 
observations  we  calculate  an  efficient  statistic,  A,  and 
a  second  consistent  statistic,  B,  having  twice  the 
variance  of  A,  then  B  will  be  a  valid  estimate  of  the 
required  parameter,  but  one  definitely  inferior  to  A 
in  its  accuracy.  Using  the  statistic  B,  a  sample  of 
2000  values  would  be  required  to  obtain  as  good  an 
estimate  as  is  obtained  by  using  the  statistic  A  from 
a  sample  of  1000  values.  We  may  say,  in  this  sense, 
that  the  statistic  B  makes  use  of  50  per  cent  of  the 
relevant  information  available  in  the  observations  ; 
or,  briefly,  that  its  efficiency  is  50  per  cent.  The  term 
"  efficient  "  in  its  absolute  sense  is  reserved  for 
statistics  the  efficiency  of  which  is  100  per  cent. 

Statistics  having  efficiency  less  than  100  per  cent 
may  be  legitimately  used  for  many  purposes.  It  is 
conceivable,  for  example,  that  it  might  in  some  cases  be 
less  laborious  to  increase  the  number  of  observations 
than  to  apply  a  more  elaborate  method  of  calculation 
to  the  results.     It  may  often  happen  that  an  inefficient 
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statistic  is  accurate  enough  to  answer  the  particular 
questions  at  issue.  There  is,  however,  one  limitation 
to  the  legitimate  use  of  inefficient  statistics  which 
should  be  noted  in  advance.  If  we  are  to  make 
accurate  tests  of  goodness  of  fit,  the  methods  of  fitting 
employed  must  not  introduce  errors  of  fitting  compar- 
able to  the  errors  of  random  sampling ;  when  this 
requirement  is  investigated,  it  appears  that  when  tests 
of  goodness  of  fit  are  required,  the  statistics  employed 
in  fitting  must  be  not  only  consistent,  but  must  be  of 
ioo  per  cent  efficiency.  This  is  a  very  serious  limita- 
tion to  the  use  of  inefficient  statistics,  since  in  the 
examination  of  any  body  of  data  it  is  desirable  to 
be  able  at  any  time  to  test  the  validity  of  one  or 
more  of  the  provisional  assumptions  which  have 
been  made. 

Numerous  examples  of  the  calculation  of  statistics 
will  be  given  in  the  following  chapters,  and  in  these 
illustrations  of  method,  efficient  statistics  have  been 
chosen.  The  discovery  of  efficient  statistics  in  new 
types  of  problem  may  require  some  mathematical 
investigation.  The  researches  of  the  author  have  led 
him  to  the  conclusion  that  an  efficient  statistic  can 
in  all  cases  be  found  by  the  Method  of  Maximum 
Likelihood  ;  that  is,  by  choosing  statistics  so  that  the 
estimated  population  should  be  that  for  which  the 
likelihood  is  greatest.  In  view  of  the  mathematical 
difficulty  of  some  of  the  problems  which  arise  it  is  also 
useful  to  know  that  approximations  to  the  maximum 
likelihood  solution  are  also  in  most  cases  efficient 
statistics.     A  simple  example  of  the  application  of  the 
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method  of  maximum  likelihood,  and  other  methods, 
to  a  genetical  problem  is  developed  in  the  final  chapter. 

For  practical  purposes  it  is  not  generally  necessary 
to  press  refinement  of  methods  further  than  the  stipula- 
tion that  the  statistics  used  should  be  efficient.  With 
large  samples  it  may  be  shown  that  all  efficient 
statistics  tend  to  equivalence,  so  that  little  incon- 
venience arises  from  diversity  of  practice.  There  is, 
however,  one  class  of  statistics,  including  some  of  the 
most  frequently  recurring  examples,  which  is  of  theo- 
retical interest  for  possessing  the  remarkable  property 
that,  even  in  small  samples,  a  statistic  of  this  class 
alone  includes  the  whole  of  the  relevant  information 
which  the  observations  contain.  Such  statistics  are 
distinguished  by  the  term  sufficient,  and,  in  the  use 
of  small  samples,  sufficient  statistics,  when  they  exist, 
are  definitely  superior  to  other  efficient  statistics. 
Examples  of  sufficient  statistics  are  the  arithmetic 
mean  of  samples  from  the  normal  distribution,  or  from 
the  Poisson  series  ;  it  is  the  fact  of  providing  sufficient 
statistics  for  these  two  important  types  of  distribution 
which  gives  to  the  arithmetic  mean  its  theoretical 
importance.  The  method  of  maximum  likelihood 
leads  to  these  sufficient  statistics  where  they  exist. 

While  diversity  of  practice  within  the  limits  of 
efficient  statistics  will  not  with  large  samples  lead  to 
inconsistencies,  it  is,  of  course,  of  importance  in  all 
cases  to  distinguish  clearly  the  parameter  of  the  popula- 
tion, of  which  it  is  desired  to  estimate  the  value,  from 
the  actual  statistic  employed  as  an  estimate  of  its 
value  ;    and  to  inform  the  reader  by  which  of  the 
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considerable  variety  of  processes  which  exist  for  the 
purpose  the  estimate  was  actually  obtained. 

4.  Scope  of  this  Book 

The  prime  object  of  this  book  is  to  put  into  the 
hands  of  research  workers,  and  especially  of  biologists, 
the  means  of  applying  statistical  tests  accurately  to 
numerical  data  accumulated  in  their  own  laboratories 
or  available  in  the  literature.  Such  tests  are  the  result 
of  solutions  of  problems  of  distribution,  most  of  which 
are  but  recent  additions  to  our  knowledge  and  have 
so  far  only  appeared  in  specialised  mathematical 
papers.  The  mathematical  complexity  of  these  prob- 
lems has  made  it  seem  undesirable  to  do  more  than 
(i.)  to  indicate  the  kind  of  problem  in  question,  (ii.) 
to  give  numerical  illustrations  by  which  the  whole 
process  may  be  checked,  (iii.)  to  provide  numerical 
tables  by  means  of  which  the  tests  may  be  made 
without  the  evaluation  of  complicated  algebraical 
expressions. 

It  would  have  been  impossible  to  give  methods 
suitable  for  the  great  variety  of  kinds  of  tests  which 
are  required  but  for  the  unforeseen  circumstances  that 
each  mathematical  solution  appears  again  and  again 
in  questions  which  at  first  sight  appeared  to  be  quite 
distinct.  For  example,  Pearson's  solution  in  1900  of 
the  distribution  of  %2  is  in  reality  equivalent  to  the 
distribution  of  the  variance  as  estimated  from  normal 
samples,  of  which  the  solution  was  not  given  until 
1908,  and  then  quite  tentatively,  and  without  com- 
plete mathematical  proof,  by  "  Student."     The  same 
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distribution  was  found  by  the  author  for  the  index  of 
dispersion  derived  from  small  samples  from  a  Poisson 
series.  What  is  even  more  remarkable  is  that,  al- 
though Pearson's  paper  of  1900  contained  a  serious 
error,  which  vitiated  most  of  the  tests  of  goodness  of 
fit  made  by  this  method  until  1921,  yet  the  correction 
of  this  error  leaves  the  form  of  the  distribution  un- 
changed, and  only  requires  that  some  few  units  should 
be  deducted  from  one  of  the  variables  with  which  the 
table  of  x2  is  entered. 

It  is  equally  fortunate  that  the  distribution  of  /, 
first  established  by  "  Student  "  in  1908,  in  his  study 
of  the  probable  error  of  the  mean,  should  be  applicable, 
not  only  to  the  case  there  treated,  but  to  the  more 
complex,  but  even  more  frequently  needed  problem 
of  the  comparison  of  two  mean  values.  It  further 
provides  an  exact  solution  of  the  sampling  errors  of  the 
enormously  wide  class  of  statistics  known  as  regression 
coefficients. 

In  studying  the  exact  theoretical  distributions  in 
a  number  of  other  problems,  such  as  those  presented 
by  intraclass  correlations,  the  goodness  of  fit  of  regres- 
sion lines,  the  correlation  ratio,  and  the  multiple  cor- 
relation coefficient,  the  author  has  been  led  repeatedly 
to  a  third  distribution,  which  may  be  called  the  distri- 
bution of  z,  and  which  is  intimately  related  to,  and 
indeed  a  natural  extension  of,  the  distributions  found 
by  Pearson  and  "  Student."  It  has  thus  been  possible 
to  classify  the  necessary  distributions,  covering  a  very 
great  variety  of  cases,  under  these  three  main  groups  ; 
and,  what  is  equally  important,  to  make  some  provision 
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for  the  need  for  numerical  values  by  means  of  a  few 
tables  only. 

The  book  has  been  arranged  so  that  the  student 
may  make  acquaintance  with  these  three  main 
distributions  in  a  logical  order,  and  proceeding  from 
more  simple  to  more  complex  cases.  Methods  de- 
veloped in  later  chapters  are  frequently  seen  to  be 
generalisations  of  simpler  methods  developed  previ- 
ously. Studying  the  work  methodically  as  a  connected 
treatise,  the  student  will,  it  is  hoped,  not  miss  the 
fundamental  unity  of  treatment  under  which  such 
very  varied  material  has  been  brought  together  ; 
and  will  prepare  himself  to  deal  competently  and  with 
exactitude  with  the  many  analogous  problems,  which 
cannot  be  individually  exemplified.  On  the  other 
hand,  it  is  recognised  that  many  will  wish  to  use  the 
book  for  laboratory  reference,  and  not  as  a  connected 
course  of  study.  This  use  would  seem  desirable 
only  if  the  reader  will  be  at  the  pains  to  work 
through,  in  all  numerical  detail,  one  or  more  of  the 
appropriate  examples,  so  as  to  assure  himself,  not 
only  that  his  data  are  appropriate  for  a  parallel  treat- 
ment, but  that  he  has  obtained  some  critical  grasp  of 
the  meaning  to  be  attached  to  the  processes  and 
results. 

It  is  necessary  to  anticipate  one  criticism,  namely, 
that  in  an  elementary  book,  without  mathematical 
proofs,  and  designed  for  readers  without  special  mathe- 
matical training,  so  much  has  been  included  which 
from  the  teacher's  point  of  view  is  advanced  ;  and 
indeed   much   that    has   not    previously   appeared    in 
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print.     By  way  of  apology  the  author  would  like  to 
put  forward  the  following  considerations. 

(1)  For  non  -  mathematical  readers,  numerical 
tables  are  in  any  case  necessary ;  accurate  tables  are 
no  more  difficult  to  use,  though  more  laborious  to 
calculate,  than  inaccurate  tables  embodying  the 
current  approximations. 

(2)  The  process  of  calculating  a  probable  error 
from  one  of  the  established  formulae  gives  no  real 
insight  into  the  random  sampling  distribution,  and  can 
only  supply  a  test  of  significance  by  the  aid  of  a  table 
of  deviations  of  the  normal  curve,  and  on  the  assump- 
tion that  the  distribution  is  in  fact  very  nearly  normal. 
Whether  this  procedure  should,  or  should  not,  be  used 
must  be  decided,  not  by  the  mathematical  attainments 
of  the  investigator,  but  by  discovering  whether  it  will 
or  will  not  give  a  sufficiently  accurate  answer.  The 
fact  that  such  a  process  has  been  used  successfully  by 
eminent  mathematicians  in  analysing  very  extensive 
and  important  material  does  not  imply  that  it  is 
sufficiently  accurate  for  the  laboratory  worker  anxious 
to  draw  correct  conclusions  from  a  small  group  of 
perhaps  preliminary  observations. 

(3)  The  exact  distributions,  with  the  use  of  which 
this  book  is  chiefly  concerned,  have  been  in  fact 
developed  in  response  to  the  practical  problems  arising 
in  biological  and  agricultural  research  ;  this  is  true  not 
only  of  the  author's  own  contribution  to  the  subject, 
but  from  the  beginning  of  the  critical  examination  of 
statistical  distributions  in  "  Student's  "  paper  of  1908. 

The    greater    part    of   the    book    is    occupied    by 
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numerical  examples  ;  and  these  perhaps  could  with 
advantage  have  been  increased  in  number.  In  choos- 
ing them  it  has  appeared  to  the  author  a  hopeless  task 
to  attempt  to  exemplify  the  great  variety  of  subject 
matter  to  which  these  processes  may  be  usefully 
applied.  There  are  no  examples  from  astronomical 
statistics,  in  which  important  work  has  been  done  in 
recent  years,  few  from  social  studies,  and  the  biological 
applications  are  scattered  unsystematically.  The 
examples  have  rather  been  chosen  each  to  exemplify 
a  particular  process,  and  seldom  on  account  of  the 
importance  of  the  data  used,  or  even  of  similar  ex- 
aminations of  analogous  data.  By  a  study  of  the 
processes  exemplified,  the  student  should  be  able  to 
ascertain  to  what  questions,  in  his  own  material,  such 
processes  are  able  to  give  a  definite  answer  ;  and, 
equally  important,  what  further  observations  would  be 
necessary  to  settle  other  outstanding  questions.  In 
conformity  with  the  purpose  of  the  examples  the 
reader  should  remember  that  they  do  not  pretend  to 
be  discussions  of  general  scientific  questions,  which 
would  require  the  examination  of  much  more  extended 
data,  and  of  other  evidence,  but  are  solely  concerned 
with  the  critical  examination  of  the  particular  batch  of 
data  presented. 

5.  Mathematical  Tables 

The  tables  of  distributions  supplied  at  the  ends  of 
several  chapters  form  a  part  essential  to  the  use  of  the 
book. 

Tables   I.  and  II. — The  importance  of  the  normal 
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distribution  has  been  recognised  at  least  from  the  time 
of  Laplace.  (The  formula  has  even  been  traced 
back  to  a  little-known  work  by  De  Moivre  of  1733.) 
Numerous  tables  have  given  in  one  form  or  another 
the  relation  between  the  deviation,  and  the  probability 
of  a  greater  deviation.  Important  sources  for  these 
values  are 

J.  Burgess  (1895),  Trans. Roy.  Soc.  Edin.,  XXXIX. 
pp.  257-321; 

J.  W.  L.  Glaisher  (187 1),  Phil.  Mag.,  Series  IV. 
Vol.  XLII.  p.  436. 

The  very  various  forms  in  which  this  relation  has 
been  tabulated  adds  considerably  to  the  labour  of 
practical  applications.  The  form  which  we  have 
adopted  for  this,  and  for  the  other  tables,  has  been 
used  for  the  normal  distribution  by 

F.  Galton  and  W.  F.  Sheppard  (1907),  Biometrika, 

v.  p.  405 ; 

T.  L.  Kelley,  Statistical  Method,  pp.  373-385; 
both  of  which  are  valuable  tables,  on  a  more  extensive 
scale  than  Table  I.  In  Table  II.  we  have  given  the 
normal  deviations  corresponding  to  very  high  odds. 
It  should  be  remembered  that  even  slight  departures 
from  the  normal  distribution  will  much  affect  these 
very  small  probabilities,  and  that  we  seldom  can  be 
certain,  in  any  particular  case,  that  these  high  odds 
will  be  accurate.  The  table  illustrates  the  general 
fact  that  the  significance  in  the  normal  distribution  of 
deviations  exceeding  four  times  the  standard  deviation 
is  extremely  pronounced. 

Table   III.;     table  of  %2. — Tables  of  the  value 
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of  P  for  different  values  of  x2  and  ri ,  were  given  by 
K.  Pearson  (1900),  Phil.  Mag.,  Series  V.  Vol.  L. 

P-  175; 
W.  P.  Elderton  (1902),  Biometrika,  I.  pp.  155-163  ; 

the    same    relationship    in    a    much    modified    form 

underlies 

K.    Pearson    (1922),    Tables  of  the  incomplete    V- 
f unction. 

Table  III.  gives  the  values  of  x2  f°r  different 
values  of  P  and  n,  in  a  form  designed  for  rapid 
laboratory  use,  and  with  a  view  to  covering  in  suffi- 
cient detail  the  range  of  values  actually  occurring  in 
practice.  For  higher  values  of  n  the  test  is  supple- 
mented by  an  easily  calculated  approximate  test. 

Table  IV.  ;  table  of  /. — Tables  of  the  same 
distribution  as  that  of  t  were  originally  given  by 
"  Student  "  in  1908  and  1917. 

The  same  author  has  since  given  a  much  extended 
table  of  the  probability  that  t  shall  be  algebraically  less 
than  any  given  value,  for  the  first  twenty  values  of  n. 
This  table  is  supplemented  by  a  rapid  method  of 
calculating  the  probability  for  higher  values  of  n. 

"  Student  "  (1925),  Metron.  Vol.  V.  No.  3,  pp. 
113-120. 

For  the  purposes  of  the  present  book  we  require 
the  values  of  /  corresponding  to  given  values  of  P 
and  n. 

Table  V.A  gives  the  values  of  the  correlation 
coefficient  for  different  levels  of  significance,  according 
to  the  extent  of  the  sample  upon  which  the  value  is 
based.      From    this   table   the   reader   may   see   at   a 
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glance  whether  or  not  any  correlation  obtained  may 
be  regarded  as  significant,  for  samples  up  to  100  pairs 
of  observations. 

Table  V.B  gives  the  values  of  the  well-known 
mathematical  function,  the  hyperbolic  tangent,  which 
we  have  introduced  in  the  calculation  of  sampling  errors 
of  the  correlation  coefficient.  The  function  is  simply 
related  to  the  logarithmic  and  exponential  functions, 
and  may  be  found  quite  easily  by  such  a  convenient 
table  of  natural  logarithms  as  is  given  in 

J.  T.  Bottomley,  Four-figure  Mathematical  Tables, 
while  the  hyperbolic  tangent  and  its  inverse  appear  in 

W.  Hall,  Four  figure  Tables  and  Constants. 
A  table  of  natural  logarithms  is  in  other  ways  a 
necessary  supplement  in  using  this  book,  as  in  other 
laboratory  calculations.  Tables  of  the  inverse  hyper- 
bolic tangent  for  correlational  work  have  been  pre- 
viously given  by 

R.  A.   Fisher  (1921),  Metron.  Vol.   I.  No.  4,  pp. 
26-27. 

Table  VI.  ;  table  of  z. — Tests  of  significance  in- 
volving the  use  of  2  occur  so  frequently,  and  their 
practical  use  has  extended  so  rapidly,  that  it  has  been 
necessary  to  supply  a  much  larger  table  of  this  function 
than  that  given  in  the  first  edition.  The  present  table 
gives  the  5  per  cent  and  the  1  per  cent  points  for 
320  pairs  of  values  of  nx  and  n2,  and  the  same  points 
can  be  found  for  other  values  with  comparative  ease. 
When  nx  and  n2  are  large,  fairly  satisfactory  approxi- 
mate formulae  are  available  to  supplement  the  table 
(p.  201). 
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Since  the  z  test  of  significance  includes  as  special 
cases  those  based  upon  ^2  and  /,  readers  who  prefer 
to  use  a  uniform  method  of  procedure  may  employ 
the  table  of  z  for  all  cases,  to  which  the  two  previous 
tests  are  applicable.  The  fuller  tables  for  ^2  and  / 
will,  however,  still  be  found  useful  for  numerous 
special  cases. 


II 

DIAGRAMS 

7.  The  preliminary  examination  of  most  data  is 
facilitated  by  the  use  of  diagrams.  Diagrams  prove 
nothing,  but  bring  outstanding  features  readily  to  the 
eye  ;  they  are  therefore  no  substitute  for  such  critical 
tests  as  may  be  applied  to  the  data,  but  are  valuable  in 
suggesting  such  tests,  and  in  explaining  the  conclusions 
founded  upon  them. 

8.  Time  Diagrams,  Growth  Rate  and  Relative 
Growth  Rate 

The  type  of  diagram  in  most  frequent  use  consists 

in  plotting  the  values  of  a  variable,  such  as  the  weight 

of  an  animal  or  of  a  sample  of  plants  against  its  age, 

or  the  size  of  a  population  at  successive  intervals  of 

time.      Distinction    should    be    drawn    between    those 

cases  in  which  the  same  group  of  animals,  as  in  a 

feeding  experiment,  is  weighed  at  successive  intervals 

of  time,  and  the  cases,   more  characteristic  of  plant 

physiology,    in    which    the    same    individuals    cannot 

be  used  twice,  but  a  parallel  sample  is  taken  at  each 

age.     The  same  distinction  occurs  in  counts  of  micro- 

25 
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organisms  between  cases  in  which  counts  are  made 
from  samples  of  the  same  culture,  or  from  samples  of 
parallel  cultures.  If  it  is  of  importance  to  obtain  the 
general  form  of  the  growth  curve,  the  second  method 
has  the  advantage  that  any  deviation  from  the  expected 
curve  may  be  confirmed  from  independent  evidence 
at  the  next  measurement,  whereas  using  the  same 
material  no  such  independent  confirmation  is  obtain- 
able. On  the  other  hand,  if  interest  centres  on  the 
growth  rate,  there  is  an  advantage  in  using  the  same 
material,  for  only  so  are  actual  increases  in  weight 
measurable.  Both  aspects  of  the  difficulty  can  be  got 
over  only  by  replicating  the  observations ;  by  carry- 
ing out  measurements  on  a  number  of  animals  under 
parallel  treatment  it  is  possible  to  test,  from  the 
individual  weights,  though  not  from  the  means, 
whether  their  growth  curve  corresponds  with  an 
assigned  theoretical  course  of  development,  or  differs 
significantly  from  it  or  from  a  series  differently  treated. 
Equally,  if  a  number  of  plants  from  each  sample  are 
weighed  individually,  growth  rates  may  be  obtained 
with  known  probable  errors,  and  so  may  be  used  for 
critical  comparisons.  Care  should  of  course  be  taken 
that  each  is  strictly  a  random  sample. 

Fig.  i  represents  the  growth  of  a  baby  weighed 
to  the  nearest  ounce  at  weekly  intervals  from  birth. 
Table  i  indicates  the  calculation  from  these  data  of 
the  absolute  growth  rate  in  ounces  per  day  and  the 
relative  growth  rate  per  day.  The  absolute  growth 
rates,  representing  the  average  actual  rates  at  which 
substance  is  added  during  each  period,  are  found  by 
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subtracting  from  each  value  that  previously  recorded, 
and  dividing  by  the  length  of  the  period.  The 
relative  growth  rates  measure  the  rate  of  increase  not 


TABLE  i 


Age  in 
Weeks. 

Weight  in 
Ounces. 

Increase. 

Growth  Rate 

per  Day 

(Oz.). 

Natural  Log. 
of  Weight. 

Increase. 

Relative 

Growth  Rate 

per  cent  per 

Day. 

O 

no 

•0953 

4 

•57 

•0357 

•51 

I 

114 

•I3IO 

14 

2-00 

•II59 

1-66 

2 

128 

•2469 

19 

2-71 

•1384 

I-98 

3 

147 

•3853 

16 

2-29 

•IO33 

1-47 

4 

I63 

•4886 

9 

1-29 

•0537 

•77 

5 

172 

•5423 

14 

2-00 

•0783 

I-I2 

6 

186 

•6206 

12 

1-71 

•0625 

•89 

7 

I98 

•683I 

10 

i*43 

•0493 

•70 

8 

208 

•7324 

5 

•71 

•0237 

•34 

9 

213 

•7561 

19 

2-71 

•0855 

1-22 

IO 

232 

•8416 

8 

1-14 

•0339 

•48 

ii 

24O 

•8955 

14 

2-00 

•O567 

•8l 

12 

254 

•9322 

7 

TOO 

•O272 

•39 

13 

26l 

•9594 

only  per  unit  of  time,  but  per  unit  of  weight  already 
attained  ;  using  the  mathematical  fact,  that 

1   dm     d  ,.  N 

m  dt     dt 
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it  is  seen  that  the  true  average  value  of  the  relative 
growth  rate  for  any  period  is  obtained  from  the  natural 
logarithms  of  the  successive  weights,  just  as  the  actual 
rates  of  increase  are  from  the  weights  themselves. 
Such  relative  rates  of  increase  are  conveniently 
multiplied  by  100,  and  thereby  expressed  as  the 
percentage  rate  of  increase  per  day.  If  these  per- 
centage rates  of  increase  had  been  calculated  on  the 
principle  of  simple  interest,  by  dividing  the  actual 
increase  by  the  weight  at  the  beginning  of  the  period, 
somewhat  higher  values  would  have  been  obtained  ; 
the  reason  for  this  is  that  the  actual  weight  of  the  baby 
at  any  time  during  each  period  is  usually  somewhat 
higher  than  its  weight  at  the  beginning.  The  error 
introduced  by  the  simple  interest  formula  becomes 
exceedingly  great  when  the  percentage  increases 
between  successive  weighings  are  large. 

Fig.  1  A  shows  the  course  of  the  increase  in  absolute 
weight ;  the  average  slope  of  such  a  diagram  shows  the 
absolute  rate  of  increase.  In  this  diagram  the  points 
fall  approximately  on  a  straight  line,  showing  that  the 
absolute  rate  of  increase  was  nearly  constant  at  about 
i-66  oz.  per  diem.  Fig.  1  B  shows  the  course  of  the 
increase  in  the  natural  logarithm  of  the  weight ;  the 
slope  at  any  point  shows  the  relative  rate  of  increase, 
which,  apart  from  the  first  week,  falls  off  perceptibly 
with  increasing  age.  The  features  of  such  curves  are 
best  brought  out  if  the  scales  of  the  two  axes  are  so 
chosen  that  the  graph  makes  with  them  approximately 
equal  angles  ;  with  nearly  vertical,  or  nearly  horizontal 
lines,  changes  in  the  slope  are  not  so  readily  perceived. 
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A  rapid  and  convenient  way  of  displaying  the  line 
of  increase  of  the  logarithm  is  afforded  by  the  use  of 
graph  paper  in  which  the  horizontal  rulings  are  spaced 
on  a  logarithmic  scale,  with  the  actual  values  indicated 
in  the  margin  (see  Fig.  5).  The  horizontal  scale  can 
then  be  adjusted  to  give  the  line  an  appropriate  slope. 
This  method  avoids  the  use  of  a  logarithm  table, 
which,  however,  will  still  be  required  if  the  values  of 
the  relative  rate  of  increase  are  needed. 

In  making  a  rough  examination  of  the  agreement 
of  the  observations  with  any  law  of  increase,  it  is 
desirable  so  to  manipulate  the  variables  that  the  law 
to  be  tested  will  be  represented  by  a  straight  line. 
Thus  Fig.  1  A  is  suitable  for  a  rough  test  of  the  law 
that  the  absolute  rate  of  increase  is  constant ;  if  it  were 
suggested  that  the  relative  rate  of  increase  were  con- 
stant, Fig.  1  B  would  show  clearly  that  this  was  not  so. 
With  other  hypothetical  growth  curves  other  trans- 
formations may  be  used  ;  for  example,  in  the  so-called 
"  autocatalytic  "  curve  the  relative  growth  rate  falls 
off  in  proportion  to  the  actual  weight  attained  at  any 
time.  If,  therefore,  the  relative  growth  rate  be  plotted 
against  the  actual  weight,  the  points  should  fall  on  a 
straight  line  if  the  "  autocatalytic  "  curve  fits  the  facts. 
For  this  purpose  it  is  convenient  to  plot  against  each 
observed  weight  the  mean  of  the  two  adjacent  relative 
growth  rates.  To  do  this  for  the  above  data  for  the 
growth  of  an  infant  may  be  left  as  an  exercise  to  the 
student  ;  twelve  points  will  be  available  for  weights 
114  to  254  ounces.  The  relative  growth  rates,  even 
after  averaging  adjacent  pairs,  will  be  very  irregular, 
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so  that  no  clear  indications  will  be  found  from  these 
data.  If  a  straight  line  is  found  to  fit  the  data,  it 
should  be  produced  to  meet  the  horizontal  axis  to  find 
the  weight  at  which  growth  ceases. 

9.  Correlation  Diagrams 

Although  most  investigators  make  free  use  of 
diagrams  in  which  an  uncontrolled  variable  is  plotted 
against  the  time,  or  against  some  controlled  factor  such 
as  concentration  of  solution,  or  temperature,  much 
more  use  might  be  made  of  correlation  diagrams  in 
which  one  uncontrolled  factor  is  plotted  against 
another.  When  this  is  done  as  a  dot  diagram,  a 
number  of  dots  are  obtained  each  representing  a  single 
experiment,  or  pair  of  observations,  and  it  is  usually 
clear  from  such  a  diagram  whether  or  not  any  close 
connexion  exists  between  the  variables.  When  the 
observations  are  few  a  dot  diagram  will  often  tell  us 
whether  or  not  it  is  worth  while  to  accumulate  observa- 
tions of  the  same  sort  ;  the  range  and  extent  of  our 
experience  is  visible  at  a  glance  ;  and  associations  may 
be  revealed  which  are  worth  while  following  up. 

If  the  observations  are  so  numerous  that  the  dots 
cannot  be  clearly  distinguished,  it  is  best  to  divide  up 
the  diagram  into  squares,  recording  the  frequency  in 
each  ;  this  semi-diagrammatic  record  is  a  correlation 
table. 

Fig.  2  shows  in  a  dot  diagram  the  yields  obtained 
from  an  experimental  plot  of  wheat  (dunged  plot, 
Broadbalk  field,  Rothamsted)  in  years  with  different 
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total  rainfall.  The  plot  was  under  uniform  treatment 
during  the  whole  period  1854-1888  ;  the  35  pairs 
of  observations,  indicated  by  35  dots,  show  well  the 
association   of  high   yield   with   low   rainfall.      Even 
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FlG.  2. — Wheat  yield  and  rainfall  for  35  years,  1854-188^ 
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when  few  observations  are  available  a  dot  diagram  may 
suggest  associations  hitherto  unsuspected,  or  what  is 
equally  important,  the  absence  of  associations  which 
would  have  been  confidently  predicted.  Their  value 
lies  in  giving  a  simple  conspectus  of  the  experience 
hitherto  gathered,  and  in  bringing  to  the  mind  sugges- 
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tions  which  may  be  susceptible  of  more  exact  statistical 
examination. 

■  Instead  of  making  a  dot  diagram  the  device  is 
sometimes  adopted  of  arranging  the  values  of  one 
variate  in  order  of  magnitude,  and  plotting  the  values 
of  a  second  variate  in  the  same  order.     If  the   line 


FlG.  3. — Rainfall  and  yield  of  35  years  arranged  in  order  of  yield. 


so  obtained  shows  any  perceptible  slope,  or  general 
trend,  the  variates  are  taken  to  be  associated.  Fig.  3 
represents  the  line  obtained  for  rainfall,  when  the 
years  are  arranged  in  order  of  wheat  yield.  Such 
diagrams  are  usually  far  less  informative  than  the 
dot  diagram,  and  often  conceal  features  of  importance 
brought  out  by  the  former.  In  addition  the  dot 
diagram  possesses  the  advantage  that  it  is  easily  used 
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as  a  correlation  table  if  the  number  of  dots  is  small, 
and  easily  transformed  into  one  if  the  number  of  dots 
is  large. 

In  the  correlation  table  the  values  of  both  variates 
are  divided  into  classes,  and  the  class  intervals  should 
be  equal  for  all  values  of  the  same  variate.  Thus  we 
might  divide  the  value  for  the  yield  of  wheat  through- 
out at  intervals  of  one  bushel,  and  the  values  of  the 
rainfall  at  intervals  of  i  inch.  The  diagram  is  thus 
divided  into  squares,  and  the  number  of  observations 
falling  into  each  square  is  counted  and  recorded.  The 
correlation  table  is  useful  for  three  distinct  purposes. 
It  affords  a  valuable  visual  representation  of  the  whole 
of  the  observations,  which  with  a  little  experience  is  as 
easy  to  comprehend  as  a  dot  diagram  ;  it  serves  as  a 
compact  record  of  extensive  data,  which,  as  far  as  the 
two  variates  are  concerned,  is  complete.  With  more 
than  two  variates  correlation  tables  may  be  given  for 
every  pair.  This  will  not  indeed  enable  the  reader  to 
reconstruct  the  original  data  in  its  entirety,  but  it  is  a 
fortunate  fact  that  for  the  great  majority  of  statistical 
purposes,  a  set  of  such  twofold  distributions  provides 
complete  information.  Original  data  involving  more 
than  two  variates  is  most  conveniently  recorded  for 
reference  on  cards,  each  case  being  given  a  separate 
card  with  the  several  variates  entered  in  corresponding 
positions  upon  them.  The  publication  of  such  com- 
plete data  presents  difficulties,  but  it  is  not  yet  suffi- 
ciently realised  how  much  of  the  essential  information 
can  be  presented  in  a  compact  form  by  means  of 
correlation  tables.     The  third  feature  of  value  about 
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the  correlation  table  is  that  the  data  so  presented  form 
a  convenient  basis  for  the  immediate  application  of 
methods  of  statistical  reduction.  The  most  important 
statistics  which  the  data  provide  can  be  most  readily 
calculated  from  the  correlation  table.  An  example  of 
a  correlation  table  is  shown  in  Table  31,  p.  142. 

10.  Frequency  Diagrams 

When  a  large  number  of  individuals  are  measured 
in  respect  of  physical  dimensions,  weight,  colour, 
density,  etc.,  it  is  possible  to  describe  with  some 
accuracy  the  population  of  which  our  experience  may 
be  regarded  as  a  sample.  By  this  means  it  may  be 
possible  to  distinguish  it  from  other  populations 
differing  in  their  genetic  origin,  or  in  environmental 
circumstances.  Thus  local  races  may  be  very  different 
as  populations,  although  individuals  may  overlap  in 
all  characters  ;  or,  under  experimental  conditions,  the 
aggregate  may  show  environmental  effects,  on  size, 
death-rate,  etc.,  which  cannot  be  detected  in  the 
individual.  A  visible  representation  of  a  large  number 
of  measurements  of  any  one  feature  is  afforded  by  a 
frequency  diagram.  The  feature  measured  is  used 
as  abscissa,  or  measurement  along  the  horizontal  axis, 
and  as  ordinates  are  set  off  vertically  the  frequencies, 
corresponding  to  each  range. 

Fig.  4  is  a  frequency  diagram  illustrating  the 
distribution  in  stature  of  1375  women  (Pearson  and 
Lee's  data  modified).  The  whole  sample  of  women  is 
divided  up  into  successive  height  ranges  of  one  inch. 
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Equal  areas  on  the  diagram  represent  equal  frequency; 
if  the  data  be  such  that  the  ranges  into  which  the 
individuals  are  subdivided  are  not  equal,  care  should 
be  taken  to  make  the  areas  correspond  to  the  observed 
frequencies,  so  that  the  area  standing  upon  any  interval 
of  the  base  line  shall  represent  the  actual  frequency 
observed  in  that  interval. 

The    class    containing    the    greatest    number    of 
observations  is  technically  known  as  the  modal  class. 
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In  Fig.  4  the  modal  class  indicated  is  the  class  whose 
central  value  is  63  inches.  When,  as  is  very  frequently 
the  case,  the  variate  varies  continuously,  so  that  all 
intermediate  values  are  possible,  the  choice  of  the 
grouping  interval  and  limits  is  arbitrary  and  will 
make  a  perceptible  difference  to  the  appearance  of  the 
diagram.  Usually,  however,  the  possible  limits  of 
grouping  will  be  governed  by  the  smallest  units  in 
which  the  measurements  are  recorded.  If,  for  example, 
measurements   of  height   were   made   to   the   nearest 
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quarter  of  an  inch,  so  that  all  values  between  66|  inches 
and  6jj$  were  recorded  as  67  inches,  all  values  between 
67!-  and  67!  were  recorded  as  67J,  then  we  have  no 
choice  but  to  take  as  our  unit  of  grouping  1,2,3,  4>  etc-> 
quarters  of  an  inch,  and  the  limits  of  each  group  must 
fall  on  some  odd  number  of  eighths  of  an  inch.  For 
purposes  of  calculation  the  smaller  grouping  units  are 
more  accurate,  but  for  diagrammatic  purposes  coarser 
grouping  is  often  preferable.  Fig.  4  indicates  a  unit 
of  grouping  suitable  in  relation  to  the  total  range  for  a 
large  sample  ;  with  smaller  samples  a  coarser  grouping 
is  usually  necessary  in  order  that  sufficient  observa- 
tions may  fall  in  each  class. 

In  all  cases  where  the  variation  is  continuous  the 
frequency  diagram  should  be  in  the  form  of  a  histo- 
gram, rectangular  areas  standing  on  each  grouping 
interval  showing  the  frequency  of  observations  in  that 
interval.  The  alternative  practice  of  indicating  the 
frequency  by  a  single  ordinate  raised  from  the  centre 
of  the  interval  is  sometimes  preferred,  as  giving  to  the 
diagram  a  form  more  closely  resembling  a  continuous 
curve.  The  advantage  is  illusory,  for  not  only  is 
the  form  of  the  curve  thus  indicated  somewhat  mis- 
leading, but  the  utmost  care  should  always  be  taken 
to  distinguish  the  infinitely  large  hypothetical  popu- 
lation from  which  our  sample  of  observations  is 
drawn,  from  the  actual  sample  of  observations  which 
we  possess  ;  the  conception  of  a  continuous  frequency 
curve  is  applicable  only  to  the  former,  and  in  illustrat- 
ing the  latter  no  attempt  should  be  made  to  slur  over 
this  distinction. 
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This  consideration  should  in  no  way  prevent  a 
frequency  curve  fitted  to  the  data  from  being  super- 
imposed upon  the  histogram  (as  in  Fig.  4)  ;  the  con- 
trast between  the  histogram  representing  the  sample, 
and  the  continuous  curve  representing  an  estimate  of 
the  form  of  the  hypothetical  population,  is  well  brought 
out  in  such  diagrams,  and  the  eye  is  aided  in  detect- 
ing any  serious  discrepancy  between  the  observations 
and  the  hypothesis.  No  eye  observation  of  such 
diagrams,  however  experienced,  is  really  capable  of 
discriminating  whether  or  not  the  observations  differ 
from  expectation  by  more  than  we  should  expect  from 
the  circumstances  of  random  sampling.  Accurate 
methods  of  making  such  tests  will  be  developed  in 
later  chapters. 

With  discontinuous  variation,  when,  for  example, 
the  variate  is  confined  to  whole  numbers,  the  above 
reason  for  insisting  on  the  histogram  form  has  little 
weight,  for  there  are,  strictly  speaking,  no  ranges  of 
variation  within  each  class.  On  the  other  hand,  there 
is  no  question  of  a  frequency  curve  in  such  cases. 
Representation  of  such  data  by  means  of  a  histogram 
is  usual  and  not  inconvenient  ;  it  is  especially  appro- 
priate if  we  regard  the  discontinuous  variation  as 
due  to  an  underlying  continuous  variate,  which  can, 
however,  express  itself  only  to  the  nearest  whole 
number. 

It  is,  of  course,  possible  to  treat  the  values  of  the 
frequency  like  any  other  variable,  by  plotting  the 
value  of  its  logarithm,  or  its  actual  value  on  loga- 
rithmic paper,  when  it  is  desired  to  illustrate  the  agree- 
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ment  of  the  observations  with  any  particular  law  of 
frequency.  Fig.  5  shows  in  this  way  the  number  of 
flowers  (buttercups)  having  5  to  10  petals  (Pearson's 
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Fig.  5. 

data),  plotted  upon  logarithmic  paper,  to  facilitate  com- 
parison with  the  hypothesis  that  the  frequency,  for 
petals  above  five,  falls  off  in  geometric  progression. 
Such  illustrations  are  not,  properly  speaking,  frequency 
diagrams,  although  the  frequency  is  one  of  the  vari- 
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ables  employed,  because  they  do  not  adhere  to  the 
convention  that  equal  frequencies  are  represented  by 
equal  areas. 

A  useful  form,  similar  to  the  above,  is  used  to 
compare  the  death-rates,  throughout  life,  of  different 
populations.  The  logarithm  of  the  number  of  sur- 
vivors at  any  age  is  plotted  against  the  age  attained. 
Since  the  death-rate  is  the  rate  of  decrease  of  the 
logarithm  of  the  number  of  survivors,  equal  gradients 
on  such  curves  represent  equal  death-rates.  They 
therefore  serve  well  to  show  the  increase  of  death- 
rate  with  increasing  age,  and  to  compare  populations 
with  different  death-rates.  Such  diagrams  are  less 
sensitive  to  small  fluctuations  than  would  be  the 
corresponding  frequency  diagrams  showing  the  dis- 
tribution of  the  population  according  to  age  at  death  ; 
they  are  therefore  appropriate  when  such  small 
fluctuations  are  due  principally  to  errors  of  random 
sampling,  which  in  the  more  sensitive  type  of  diagram 
might  obscure  the  larger  features  of  the  comparison. 
It  should  always  be  remembered  that  the  choice  of  the 
appropriate  methods  of  statistical  treatment  is  quite 
independent  of  the  choice  of  methods  of  diagram- 
matic representation. 
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11.  The  idea  of  an  infinite  population  distributed 
in  a  frequency  distribution  in  respect  of  one  or  more 
characters  is  fundamental  to  all  statistical  work. 
From  a  limited  experience,  for  example,  of  individuals 
of  a  species,  or  of  the  weather  of  a  locality,  we  may 
obtain  some  idea  of  the  infinite  hypothetical  popula- 
tion from  which  our  sample  is  drawn,  and  so  of  the 
probable  nature  of  future  samples  to  which  our  con- 
clusions are  to  be  applied.  If  a  second  sample  belies 
this  expectation  we  infer  that  it  is,  in  the  language  of 
statistics,  drawn  from  a  different  population  ;  that  the 
treatment  to  which  the  second  sample  of  organisms  had 
been  exposed  did  in  fact  make  a  material  difference, 
or  that  the  climate  (or  the  methods  of  measuring  it) 
had  materially  altered.  Critical  tests  of  this  kind 
may  be  called  tests  of  significance,  and  when  such 
tests  are  available  we  may  discover  whether  a  second 
sample  is  or  is  not  significantly  different  from  the 
first. 

A  statistic  is  a  value  calculated  from  an  observed 
sample  with  a  view  to  characterising  the  population 
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from  which  it  is  drawn.  For  example,  the  mean  of  a 
number  of  observations  x1}  x2,  .  .  .  xn,  is  given  by 
the  equation 

x  =-S(V), 
n 

where  S  stands  for  summation  over  the  whole  sample, 
and  n  for  the  number  of  observations.  Such  statistics 
are  of  course  variable  from  sample  to  sample,  and 
the  idea  of  a  frequency  distribution  is  applied  with 
especial  value  to  the  variation  of  such  statistics.  If 
we  know  exactly  how  the  original  population  was 
distributed  it  is  theoretically  possible,  though  often  a 
matter  of  great  mathematical  difficulty,  to  calculate 
how  any  statistic  derived  from  a  sample  of  given  size 
will  be  distributed.  The  utility  of  any  particular 
statistic,  and  the  nature  of  its  distribution,  both 
depend  on  the  original  distribution,  and  appropriate 
and  exact  methods  have  been  worked  out  for  only  a 
few  cases.  The  application  of  these  cases  is  greatly 
extended  by  the  fact  that  the  distribution  of  many 
statistics  tends  to  the  normal  form  as  the  size  of  the 
sample  is  increased.  For  this  reason  it  is  customary 
to  assume  that  such  statistics  are  normally  distributed, 
and  to  limit  consideration  of  their  variability  to  calcula- 
tions of  the  standard  error  or  probable  error. 

In  the  present  chapter  we  shall  give  some  account 
of  three  principal  distributions — (i.)  the  normal  distri- 
bution, (ii.)  the  Poisson  series,  (iii.)  the  binomial  distri- 
bution. It  is  important  to  have  a  general  knowledge 
of  these  three  distributions,  the  mathematical  formula* 
by    which    they    are    represented,    the    experimental 
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conditions  upon  which  they  occur,  and  the  statistical 
methods  of  recognising  their  occurrence.  On  the 
latter  topic  we  shall  be  led  to  some  extent  to  anticipate 
methods  developed  more  systematically  in  Chaps. 
IV.  and  V. 


12.  The  Normal  Distribution 

A  variate  is  said  to  be  normally  distributed  when 
it  takes  all  values  from  -  x>  to  +  do  ,  with  frequencies 
given  by  a  definite  mathematical  law,  namely,  that  the 
logarithm  of  the  frequency  at  any  distance  x  from 
the  centre  of  the  distribution  is  less  than  the  logarithm 


Flat   tupped    iurv<      ,  %  ntgitm  )...    A 

Normal       cum  (Y,=0)  B 


FlG.  6. — Showing  a  way  in  which  a  symmetrical  frequency  curve  may  depart 
from  the  normal  distribution. 


of  the  frequency  at  the  centre  by  a  quantity  propor- 
tional to  x2.  The  distribution  is  therefore  symmetrical, 
with  the  greatest  frequency  at  the  centre  ;  although 
the  variation  is  unlimited,  the  frequency  falls  off  to 
exceedingly  small  values  at  any  considerable  distance 
from  the  centre,  since  a  large  negative  logarithm 
corresponds  to  a  very  small  number.  Fig.  6  B  repre- 
sents a  normal  curve  of  distribution.     The  frequency 
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in  any  infinitesimal  range  dx  may  be  written  as 

I  -\    (x-mV 

d/  = — -=e     '    °*    dx, 

a\/27r 

where  x  -m  is  the  distance  of  the  observation,  x, 
from  the  centre  of  the  distribution,  m  ;  and  a,  called 
the  standard  deviation,  measures  in  the  same  units 
the  extent  to  which  the  individual  values  are  scattered. 
Geometrically  a  is  the  distance,  on  either  side  of  the 
centre,  of  the  steepest  points,  or  points  of  inflexion  of 
the  curve  (Fig.  4). 

In  practical  applications  we  do  not  so  often  want  to 
know  the  frequency  at  any  distance  from  the  centre 
as  the  total  frequency  beyond  that  distance  ;  this  is 
represented  by  the  area  of  the  tail  of  the  curve  cut 
off  at  any  point.  Tables  of  this  total  frequency, 
or   probability  integral,   have   been  constructed  from 

which,  for  any  value  of  -       -,we  can  find  what  fraction 

of  the  total  population  has  a  larger  deviation  ;  or,  in 
other  words,  what  is  the  probability  that  a  value  so 
distributed,  chosen  at  random,  shall  exceed  a  given 
deviation.  Tables  I.  and  II.  have  been  constructed 
to  show  the  deviations  corresponding  to  different 
values  of  this  probability.  The  rapidity  with  which 
the  probability  falls  off  as  the  deviation  increases  is 
well  shown  in  these  tables.  A  deviation  exceeding 
the  standard  deviation  occurs  about  once  in  three 
trials.  Twice  the  standard  deviation  is  exceeded  only 
about  once  in  22  trials,  thrice  the  standard  deviation 
only  once  in  370  trials,  while  Table  II.  shows  that 
to  exceed  the  standard  deviation  sixfold  would  need 
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nearly  a  thousand  million  trials.  The  value  for  which 
P  =  -05,  or  1  in  20,  is  1-96  or  nearly  2  ;  it  is  convenient 
to  take  this  point  as  a  limit  in  judging  whether  a 
deviation  is  to  be  considered  significant  or  not.  Devia- 
tions exceeding  twice  the  standard  deviation  are  thus 
formally  regarded  as  significant.  Using  this  criterion, 
we  should  be  led  to  follow  up  a  negative  result  only 
once  in  22  trials,  even  if  the  statistics  are  the  only 
guide  available.  Small  effects  would  still  escape 
notice  if  the  data  were  insufficiently  numerous  to  bring 
them  out,  but  no  lowering  of  the  standard  of  signi- 
ficance would  meet  this  difficulty. 

Some  little  confusion  is  sometimes  introduced  by 
the  fact  that  in  some  cases  we  wish  to  know  the  prob- 
ability that  the  deviation,  known  to  be  positive,  shall 
exceed  an  observed  value,  whereas  in  other  cases  the 
probability  required  is  that  a  deviation,  which  is 
equally  frequently  positive  and  negative,  shall  exceed 
an  observed  value  ;  the  latter  probability  is  always 
half  the  former.  For  example,  Table  I.  shows  that  the 
normal  deviate  falls  outside  the  range  ±1*598193  in 
10  per  cent  of  cases,  and  consequently  that  it  exceeds 
+  1-598193  in  5  per  cent  of  cases. 

The  value  of  the  deviation  beyond  which  half  the 
observations  lie  is  called  the  quartile  distance,  and 
bears  to  the  standard  deviation  the  ratio  -67449. 
It  is  therefore  a  common  practice  to  calculate  the 
standard  error  and  then,  multiplying  it  by  this  factor, 
to  obtain  the  probable  error.  The  probable  error  is 
thus  about  two-thirds  of  the  standard  error,  and  as 
a  test  of  significance  a  deviation  of  three  times  the 
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probable  error  is  effectively  equivalent  to  one  of  twice 
the  standard  error.  The  common  use  of  the  probable 
error  is  its  only  recommendation  ;  when  any  critical 
test  is  required  the  deviation  must  be  expressed  in 
terms  of  the  standard  error  in  using  the  probability 
integral  table. 

13.  Fitting  the  Normal  Distribution 

From  a  sample  of  n  individuals  of  a  normal 
population  the  mean  and  standard  deviation  of  the 
population  may  be  estimated  by  means  of  two  easily 
calculated  statistics.    The  best  estimate  of  m  is  x  where 

*=_so), 

n 
while  for  the  best  estimate  of  a,  we  calculate  s  from 

s2=  —  S(x-x)2; 
n  -  i 

these  two  statistics  are  calculated  from  the  first  two 
moments  (see  Appendix,  p.  72)  of  the  sample,  and 
are  specially  related  to  the  normal  distribution,  in 
that  they  summarise  the  whole  of  the  information 
which  the  sample  provides  as  to  the  distribution  from 
which  it  was  drawn,  provided  the  latter  was  normal. 
Fitting  by  moments  has  also  been  widely  applied  to 
skew  (asymmetrical)  curves,  and  others  which  are  not 
normal  ;  but  such  curves  have  not  the  peculiar  pro- 
perties which  make  the  first  two  moments  especially 
appropriate,  and  where  the  curves  differ  widely  from 
the  normal  form  the  above  two  statistics  may  be  of 
little  or  no  use. 

Ex.   2.  Fitting  a  normal  distribution  to  a  large 
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sample. — In  calculating  the  statistics  from  a  large 
sample  it  is  not  necessary  to  calculate  individually 
the  squares  of  the  deviations  from  the  mean  of  each 
measurement.  The  measurements  are  grouped  to- 
gether in  equal  intervals  of  the  variate,  and  the  whole 
of  the  calculation  may  be  carried  out  rapidly  as  shown 
in  Table  2,  where  the  distribution  of  the  stature  of 
1 1 64  men  is  analysed. 

The  first  column  shows  the  central  height  in  inches 
of  each  group,  followed  by  the  corresponding  fre- 
quencies. A  central  group  (68-5")  is  chosen  as 
"  working  mean."  To  form  the  next  column  the 
frequencies  are  multiplied  by  1,  2,  3,  etc.,  according  to 
their  distance  from  the  working  mean  ;  this  process 
being  repeated  to  form  the  fourth  column,  which  is 
summed  from  top  to  bottom  in  a  single  operation  ; 
in  the  third  column,  however,  the  upper  portion, 
representing  negative  deviations,  is  summed  separ- 
ately, and  subtracted  from  the  sum  of  the  lower  portion. 
The  difference,  in  this  case  positive,  shows  that  the 
whole  sample  of  1164  individuals  has  in  all  167  inches 
more  than  if  every  individual  were  68- 5"  in  height. 
This  balance  divided  by  1164  gives  the  amount 
by  which  the  mean  of  the  sample  exceeds  68-5". 
The  mean  of  the  sample  is  therefore  68-6435".  The 
sum  of  the  fourth  column  is  also  divided  by  1164,  and 
gives  an  uncorrected  estimate  of  the  variance  ;  two 
corrections  are  then  applied — one  is  for  the  fact  that 
the  working  mean  differs  from  the  true  mean,  and 
consists  in  subtracting  the  square  of  the  difference  ; 
the  second,  which  is  Sheppard's  correction  for  group- 
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TABLE  2 


Central  Height 

Men 

Frequency 

Frequency 

(Inches). 

(Frequency). 

X 

Deviation. 

X 
(Deviation)*. 

Women. 

52-5 

*5 

53*5 

•5 

54-5 

55*5 

1 

56-5 

5 

57'5 

15 

58-5 

15-5 

59*5 

I 

-      9 

8l 

52 

6o«5 

2-5 

-    20 

l6o 

IOI 

6i-5 

i-5 

-    10-5 

73*5 

150 

62.5 

9'5 

-    57 

342 

199 

63-5 

3i 

-155 

775 

223 

64-5 

56 

-  224 

•      896 

215 

65-5 

78-5 

-235*5 

706-5* 

169-5 

66-5 

127 

-254 

508 

I5I-5 

67-5 

178-5 

-  178-5 

178-5 

81.5 
40-5 
19-5 

68-5 

1  Km 

-1 143*5 

69-5 

137 

137 

137 

70-5 

i37 

274 

548 

10 

7i-5 

93 

279 

837 

5 

72-5 

52-5 

210 

840 

73*5 

39 

195 

975 

1 

74'5 

17 

102 

612 

75'5 

6-5 

45-5 

318-5 

76.5 

3-5 

28 

224 

77-5 

1 

9 

81 

78-5 

2 

20 

200 

79*5 

1 

1 1 

121 

1310-5 

1 1 64 

+  167 

8614 

1456 

Divided  by          i  164 

+  •1435               7*4003 

Correction  for  mean 

(•1435)2              -°2°6 

Correction  for  grouping 
Corrected  estimate  of  vai 

-iance 

•o833 

7-2964 

Working  mean        68"*  5 

Correction 

+    o"-m 

35 

Estimate  of  variance 
Estimate  of  S.I). 

7-2964 
2"'  70 1 2 

Estimate  of  r 

ncan    68"-6^ 

L35 

Standard  error         ±"-0/ 

97               Standard  error 

1  "-0560 
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ing,  allows  for  the  fact  that  the  process  of  grouping 
tends  somewhat  to  exaggerate  the  variance,  since  in 
each  group  the  values  with  deviations  smaller  than  the 
central  value  will  generally  be  more  numerous  than 
the  values  with  deviations  larger  than  the  central 
value.  Working  in  units  of  grouping,  this  correction 
is  easily  applied  by  subtracting  a  constant  quantity 
tV(  =  ,°^33)  fr°m  tne  variance.  From  the  variance 
so  corrected  the  standard  deviation  is  obtained  by 
taking  the  square  root.  This  process  may  be  carried 
through  as  an  exercise  with  the  distribution  of  female 
statures  given  in  the  same  table  (see  also  Section  23). 

Any  interval  may  be  used  as  a  unit  of  grouping  ; 
and  the  whole  calculation  is  carried  through  in  such 
units,  the  final  results  being  transformed  into  other 
units  if  required,  just  as  we  might  wish  to  transform 
the  mean  and  standard  deviation  from  inches  to  centi- 
metres by  multiplying  by  the  appropriate  factor.  It 
is  advantageous  that  the  units  of  grouping  should  be 
exact  multiples  of  the  units  of  measurement  ;  so  that 
if  the  above  sample  had  been  measured  to  tenths  of  an 
inch,  we  might  usefully  have  grouped  them  at  intervals 
of  o-6"  or  07". 

Regarded  as  estimates  of  the  mean  and  standard 
deviation  of  a  normal  population  of  which  the  above  is 
regarded  as  a  sample,  the  values  found  are  affected  by 
errors  of  random  sampling  ;  that  is,  we  should  not 
expect  a  second  sample  to  give  us  exactly  the  same 
values.  The  values  for  different  (large)  samples  of 
the  same  size  would,  however,  be  distributed  very 
accurately   in   normal   distributions,    so   the  accuracy 

E 
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of  any  one  such  estimate  may  be  satisfactorily  ex- 
pressed by  its  standard  error.  These  standard  errors 
may  be  calculated  from  the  standard  deviation  of  the 
population,  and  in  treating  large  samples  we  take  our 
estimate  of  this  standard  deviation  as  the  basis  of  the 
calculation.  The  formulae  for  the  standard  errors  of 
random  sampling  of  estimates  of  the  mean  and  standard 
deviation  of  a  large  normal  sample  are  (as  given  in 
Appendix,  p.  jt>) 


—i 


\' n      \/2n 

and  their  numerical  values  have  been  appended  to  the 
quantities  to  which  they  refer.  From  these  values  it 
is  seen  that  our  sample  shows  significant  aberration 
from  any  population  whose  mean  lay  outside  the  limits 
68-48"-68-8o",  and  it  is  therefore  likely  that  the  mean 
of  the  population  from  which  it  was  drawn  lay  between 
these  limits  ;  similarly  it  is  likely  that  its  standard 
deviation  lay  between  2-59"  and  2-81". 

It  may  be  asked,  Is  nothing  lost  by  grouping  ? 
Grouping  in  effect  replaces  the  actual  data  by  fictitious 
data  placed  arbitrarily  at  the  central  values  of  the 
groups  ;  evidently  a  very  coarse  grouping  might  be 
very  misleading.  It  has  been  shown  that  as  regards 
obtaining  estimates  of  the  parameters  of  a  normal 
population,  the  loss  of  information  caused  by  grouping 
is  less  than  1  per  cent,  provided  the  group  interval 
does  not  exceed  one  quarter  of  the  standard  deviation  ; 
the  grouping  of  the  above  sample  in  whole  inches  is 
thus  somewhat  too  coarse  ;  the  loss  in  the  estimation 
of  the  standard  deviation  is  2-28  per  cent,  or  about  27 
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observations  out  of  11 64;  the  loss  in  the  estimation 
of  the  mean  is  half  as  great.  With  suitable  group 
intervals,  however,  little  is  lost  by  grouping,  and  much 
labour  is  saved. 

Another  way  of  regarding  the  loss  of  information 
involved  in  grouping  is  to  consider  how  near  the  values 
obtained  for  the  mean  and  standard  deviation  will  be 
to  the  values  obtained  without  grouping.  From  this 
point  of  view  we  may  calculate  a  standard  error  of 
grouping,  not  to  be  confused  with  the  standard  error 
of  random  sampling  which  measures  the  deviation  of 
the  sample  values  from  the  population  value.  In 
grouping  units,  the  standard  error  due  to  grouping  of 
both  the  mean  and  the  standard  deviation  is 


y/ \2n 


or  in  this  case  -0085".  For  sufficiently  fine  grouping 
this  should  not  exceed  one-tenth  of  the  standard  error 
of  random  sampling. 

In  the  above  analysis  of  a  large  sample  the  estimate 
of  the  variance  employed  was 

n 

which  differs  from  the  formula  given  previously  (p.  46) 
in  that  we  have  divided  by  n  instead  of  by  (n  -  1).  In 
large  samples  the  difference  between  these  formulae  is 
small,  and  that  using  n  may  claim  a  theoretical  advan- 
tage if  we  wish  for  an  estimate  to  be  used  in  conjunction 
with  the  estimate  of  the  mean  from  the  same  sample, 
as  in  fitting  a  frequency  curve  to  the  data  ;    otherwise 
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it  is  best  to  use  (n  -  i).  In  small  samples  the  differ- 
ence is  still  small  compared  to  the  probable  error,  but 
becomes  important  if  a  variance  is  estimated  by- 
averaging  estimates  from  a  number  of  small  samples. 
Thus  if  a  series  of  experiments  are  carried  out  each 
with  six  parallels  and  we  have  reason  to  believe  that 
the  variation  is  in  all  cases  due  to  the  operation  of 
analogous  causes,  we  may  take  the  average  of  such 
quantities  as 


n  -  i  5 

to  obtain  an  unbiassed  estimate  of  the  variance, 
whereas  we  should  under-estimate  it  were  we  to  divide 
by  6. 

14.  Test  of  Departure  from  Normality 

It  is  sometimes  necessary  to  test  whether  an 
observed  sample  does  or  does  not  depart  significantly 
from  normality.  For  this  purpose  the  third,  and  some- 
times the  fourth  moment,  is  calculated  ;  from  each  of 
these  it  is  possible  to  calculate  a  quantity,  7,  which 
is  zero  for  a  normal  distribution,  and  is  distributed 
normally  about  zero  for  large  samples  ;  the  standard 
error  being  calculable  from  the  size  of  the  sample. 
The  quantity  y1}  which  is  calculated  from  the  third 
moment,  is  essentially  a  measure  of  asymmetry  ;  it  is 
equal  to  bV&i  of  Pearson's  notation;  72(=As-3), 
calculated  from  the  fourth  moment,  measures  a 
symmetrical  type  of  departure  from  the  normal  form, 
by  which  the  apex  and  the  two  tails  of  the  curve  are 
increased  at  the  expense  of  the  intermediate  portion, 
or,  when  y2  is  negative,  the  top  and  tails  are  depleted 
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and    the    shoulders    filled    out,    making    a    relatively 
flat-topped  curve.     (See  Fig.  6,  p.  43.) 

Ex.  3.     Use  of  higher  moments  to  test  normality. — 
Departures  from  normal  form,   unless  very  strongly 

TABLE  3 


495 

1 

7 

49 

343 

2401 

526 

3'5 

21 

126 

756 

4536 

557 

2-5 

12-5 

62-5 

312-5 

1562-5 

588 

5 

20 

80 

320 

1280 

618 

6 

18 

54 

162 

486 

649 

7 

14 

28 

56 

112 

680 
711 
742 

3-5 

3*5 

3'5 

3*5 

3*5 

5*5 

6 

6 

6 

6 

6 

773 

6 

12 

24 

48 

96 

804 

6-5 

19-5 

58-5 

175*5 

526-5 

835 

i'5 

6 

24 

96 

384 

866 

1 

5 

25 

125 

625 

897 

3 

18 

108 

648 

3888 

928 

3 

21 

147 

1029 

7203 

959 

1 

8 

64 

512 

4096 

990 

1 

9 

81 

729 

6561 

102 1 

1051 

2 

22 

242 

2662 

29282 

65 

30*5 

1182-5 

4077-5 

63048-5 

V 

•4692 

18-1923 

62-73 

969-98 

V 

--220I 

-25-40 

-93-85 

17-9722 

37-33 

876-13 

p 

"  -0833 

-8-96 

17-8889 

37*33 

867-17 

7 

•493  ±-306 

-•200±'6l2 

marked,  can  only  be  detected  in  large  samples  ;  we 
give  an  example  (Table  3)  of  the  calculation  for 
65  values  of  the  yearly  rainfall  at  Rothamsted  ;  the 
process  of  calculation  is  similar  to  that  of  finding  the 
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mean  and  standard  deviation,  but  it  is  carried  two 
stages  further,  in  the  calculation  of  the  3rd  and  4th 
moments.  The  formulae  by  which  the  two  correc- 
tions are  applied  to  the  moments  are  gathered  in  an 
appendix,  p.  73.     For  the  moments  we  obtain 

^2  =  17-89,     /*3=  +37*33,     f*4  =867-17, 

whence  are  calculated 

7i  =/W/*23/2  =  +  *493>     72=^-3=  ~  -290. 

M2 

For  samples  from  a  normal  distribution  the  standard 
errors  of  yx  and  y2  are  V^Jn  and  A/24/^,  of  which  the 
numerical  values  are  given.  It  will  be  seen  that  yx 
exceeds  its  standard  error,  but  y2  is  quite  insignificant  ; 
since  <yx  is  positive  it  appears  that  there  may  be  some 
asymmetry  of  the  distribution  in  the  sense  that  moder- 
ately dry  and  very  wet  years  are  respectively  more 
frequent  than  moderately  wet  and  very  dry  years. 

15.  Discontinuous  Distributions 

Frequently  a  variable  is  not  able  to  take  all  possible 
values,  but  is  confined  to  a  particular  series  of  values, 
such  as  the  whole  numbers.  This  is  obvious  when  the 
variable  is  a  frequency,  obtained  by  counting,  such  as 
the  number  of  cells  on  a  square  of  a  haemacytometer, 
or  the  number  of  colonies  on  a  plate  of  culture  medium. 
The  normal  distribution  is  the  most  important  of  the 
continuous  distributions  ;  but  among  discontinuous 
distributions  the  Poisson  series  is  of  the  first  importance. 
If  a  number  can  take  the  values  o,  1,  2,  .  .  .,  x,  .  .  ., 
and  the  frequency  with  which  the  values  occur  are 
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given  by  the  series 


-(„ 


m. 


~2\ 


mx  \ 


(where  x\  stands  for  "  factorial  x  "  =  x{x-  \){x  -  2) 
.  .  .  1),  then  the  number  is  distributed  in  the  Poisson 
series.  Whereas  the  normal  curve  has  two  unknown 
parameters,  m  and  a,  the  Poisson  series  has  only  one. 
This  value  may  be  estimated  from  a  series  of  observa- 
tions, by  taking  their  mean,  the  mean  being  a  statistic 
as  appropriate  to  the  Poisson  series  as  it  is  to  the 
normal  curve.  It  may  be  shown  theoretically  that 
if  the  probability  of  an  event  is  exceedingly  small, 
but  a  sufficiently  large  number  of  independent  cases 
are  taken  to  obtain  a  number  of  occurrences,  then  this 
number  will  be  distributed  in  the  Poisson  series.  For 
example,  the  chance  of  a  man  being  killed  by  horse- 
kick  on  any  one  day  is  exceedingly  small,  but  if  an 
army  corps  of  men  are  exposed  to  this  risk  for  a  year, 
a  certain  number  of  them  will  often  be  killed  in  this 
way.  The  following  data  (Bortkewitch's  data)  were 
obtained  from  the  records  of  ten  army  corps  for  twenty 
years  : 

TABLE  4 


Deaths. 

Frequency 
observed. 

Expected. 

O 

I 
2 
3 

4 

5 
6 

IO9 

65 
22 

3 

I 

108-67 
66-29 
20-22 

4-1 1 

•63 
•08 

•01 

56 
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The  average,  m,  is  -6i,  and  using  this  value 
the  numbers  calculated  agree  excellently  with  those 
observed. 

The  importance  of  the  Poisson  series  in  biological 
research  was  first  brought  out  in  connexion  with  the 
accuracy  of  counting  with  a  haemacytometer.     It  was 

TABLE  5 


Number  of  Cells. 

Frequency  observed. 

Frequency  expected. 

O 

.  , 

3*71 

I 

20 

17*37 

2 

43 

4O-05 

3 

53 

63-41 

4 

86 

74-19 

5 

70 

69-44 

6 

54 

54-16 

7 

37 

36-21 

8 

IS 

21-18 

9 

10 

I  1-02 

IO 

5 

5.16 

1 1 

2 

2-19 

12 

2 

•86 

13 

•3i 

14 

•10 

15 

•03 

16 

Total      . 

•01 

400 

400-00 

shown  that  when  the  technique  of  the  counting  process 
was  effectively  perfect,  the  number  of  cells  on  each 
square  should  be  theoretically  distributed  in  a  Poisson 
series  ;  it  was  further  shown  that  this  distribution 
was,  in  favourable  circumstances,  actually  realised 
in  practice.  Thus  the  preceding  table  (Student's 
data)  shows  the  distribution  of  yeast  cells  in  the  400 
squares  into  which  one  square  millimetre  was  divided. 
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The  total  number  of  cells  counted  is  1872,  and  the 
mean  number  is  therefore  4-68.  The  expected  fre- 
quencies calculated  from  this  mean  agree  well  with 
those  observed.  The  methods  of  testing  the  agree- 
ment are  explained  in  Chapter  IV. 

When  a  number  is  the  sum  of  several  components 
each  of  which  is  independently  distributed  in  a  Poisson 
series,  then  the  total  number  is  also  so  distributed. 
Thus  the  total  count  of  1872  cells  may  be  regarded  as 
a  single  sample  of  a  series,  for  which  m  is  not  far  from 
1872.  For  such  large  values  of  m  the  distribution  of 
numbers  approximates  closely  to  the  normal  form,  in 
such  a  way  that  the  variance  is  equal  tow;  we  may 
therefore  attach  to  the  number  counted,  1872,  the 
standard  error  ±  V1872  =±43-26,  to  represent  the 
standard  error  of  random  sampling  of  such  a  count. 
The  density  of  cells  in  the  original  suspension  is  there- 
fore estimated  with  a  standard  error  of  2-31  per  cent. 
If,  for  instance,  a  parallel  sample  differed  by  7  per 
cent,  the  technique  of  sampling  would  be  suspect. 


16.  Small  Samples  of  a  Poisson  Series 

Exactly  the  same  principles  as  govern  the  accuracy 
of  a  hemacytometer  count  would  also  govern  a  count 
of  bacterial  or  fungal  colonies  in  estimating  the 
numbers  of  those  organisms  by  the  dilution  method, 
if  it  could  be  assumed  that  the  technique  of  dilution 
afforded  a  perfectly  random  distribution  of  organisms, 
and  that  these  could  develop  on  the  plate  without 
mutual  interference.     Agreement  of  the  observations 
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with  the  Poisson  distribution  thus  affords  in  the  dilution 
method  of  counting  a  test  of  the  suitability  of  the 
technique  and  medium  similar  to  the  test  afforded  of 
the  technique  of  haemacytometer  counts.  The  great 
practical  difference  between  these  cases  is  that  from 
the  haemacytometer  we  can  obtain  a  record  of  a  large 
number  of  squares  with  only  a  few  organisms  on  each, 
whereas  in  a  bacterial  count  we  may  have  only  5 
parallel  plates,  bearing  perhaps  200  colonies  apiece. 
From  a  single  sample  of  5  it  would  be  impossible  to 
demonstrate  that  the  distribution  followed  the  Poisson 
series  ;  however,  when  a  large  number  of  such  samples 
have  been  obtained  under  comparable  conditions,  it 
is  possible  to  utilise  the  fact  that  for  all  Poisson  series 
the  variance  is  numerically  equal  to  the  mean. 

For  each  set  of  parallel  plates  with  xu  x2, 
.  .  .,  xn  colonies  respectively,  taking  the  mean  x, 
an  index  of  dispersion  may  be  calculated  by  the 
formula 

V«-S(«  -*)2 

K  x 

It  has  been  shown  that  for  true  samples  of  a  Poisson 
series,  %2  calculated  in  this  way  will  be  distributed 
in  a  known  manner;  Table  III.  (p.  96)  shows  the 
principal  values  of  x2  f°r  this  distribution  ;  entering 
the  table  take  n  equal  to  one  less  than  the  number 
of  parallel  plates.  For  small  samples  the  permissible 
range  of  variation  of  x2  is  wide  ;    thus  for  five  plates 

with  n  =4,  x2  w^  De  ^ess  tnan  I,OD4  m  IO  Per  cent  °f 
cases,  while  the  highest  10  per  cent  will,  exceed  7-779  ; 
a  single  sample  of  5  thus  gives  us  little  information  ; 
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but  if  we  have  50  or  100  such  samples,  we  are  in 
a  position  to  verify  with  accuracy  if  the  expected 
distribution  is  obtained. 

Ex.  4.  Test  of  agreement  with  Poisson  series  of 
a  number  of  small  samples. — From  100  counts  of 
bacteria  in  sugar  refinery  products  the  following  values 
were  obtained  (Table  6)  ;   there  being  6  plates  in  each 

TABLE  6 


x2. 

Expected. 

Observed. 

Expected 
43  Per  Cent. 

0 

I 

26 

•43 

•554 

I 

6 

•43 

•752 

3 

11 

1-29 

1-145 

5 

7 

2-15 

i-6io 

10 

7 

4'3 

2*343 

10 

2 

4*3 

3-000 

20 

12 

8-6 

4*351 

20 

7 

8-6 

6-064 

10 

3 

4*3 

7-289 

10 

4 

4*3 

9-236 

5 

1 

2-15 

11-070 

3 

3 

1-29 

13-388 

1 

0 

•43 

15-086 

Total 

1 

11 

•43 

100 

100 

43 
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case,  the  values  of  x2  were  taken  from  the  x2  table 
for  n  =  5. 

It  is  evident  that  the  observed  series  differs  strongly 
from  expectation  ;  there  is  an  enormous  excess  in  the 
first  class,  and  in  the  high  values  over  15  ;  the  rela- 
tively few  values  from  2  to  15  are  not  far  from  the 
expected  proportions,  as  is  shown  in  the  last  column  by 
taking  43  per  cent  of  the  expected  values.  It  is  possible 
then  that  even  in  this  case  nearly  half  of  the  samples 
were  satisfactory,  but  about  10  per  cent  were  excess- 
ively variable,  and  in  about  45  per  cent  of  the  cases 
the  variability  was  abnormally  depressed. 

It  is  often  desirable  to  test  if  the  variability  is 
of  the  right  magnitude  when  we  have  not  accumulated 
a  large  number  of  counts,  all  with  the  same  number 
of  parallel  plates,  but  where  a  certain  number  of 
counts  are  available  with  various  numbers  of  parallels. 
In  this  case  we  cannot  indeed  verify  the  theoretical 
distribution  with  any  exactitude,  but  can  test  whether 
or  not  the  general  level  of  variability  conforms  with 
expectation.  The  sum  of  a  number  of  independent 
values  of  x2  IS  itself  distributed  in  the  manner  shown 
in  the  table  of  x2,  provided  we  take  for  n  the  number 
S(n),  calculated  by  adding  the  several  values  of  n 
for  the  separate  experiments.  Thus  for  six  sets  of 
4  plates  each  the  total  value  of  x2  was  found  to  be  13*85, 
the  corresponding  value  of  n  is  6  x  3  =  18,  and  the  x2 
table  shows  that  for  n  =  18  the  value  13-85  is  exceeded 
in  between  70  and  80  per  cent  of  cases  ;  it  is  therefore 
not  an  abnormal  value  to  obtain.  In  another  case  the 
following  values  were  obtained  : 
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TABLE  7 

Number  of 
Plates  in  Set. 

Number  of  Sets. 

S(«). 

Total  x2- 

4  3 

5  * 
9     9 

Total 

8 

36 

1 

24 

144 

8 

27-31 
133-96 

8-73 

176 

170-00 

We  have  therefore  to  test  if  x2  =  170  is  an  unreason- 
ably small  or  great  value  for  n  =  176.  The  x2  table 
has  not  been  calculated  beyond  72  =  30,  but  for  higher 
values  we  make  use  of  the  fact  that  the  distribution  of 
X  becomes  nearly  normal.  A  good  approximation  is 
given  by  assuming  that  (vX2x2  -  V '  m  -  1)  is  normally 
distributed  about  zero  with  unit  standard  deviation. 
If  this  quantity  is  materially  greater  than  2,  the  value 
of  x2  IS  not  m  accordance  with  expectation.  In  the 
example  before  us 

V^2  =  18-44 

y/2n 


2*  -340, 
2n-\  =351, 


=  1873 


Difference 


•29 


The  set  of  45  counts  thus  shows  variability  between 
parallel  plates,  very  close  to  that  to  be  expected  theo- 
retically. The  internal  evidence  thus  suggests  that 
the  technique  was  satisfactory. 


17.  Presence  and  Absence  of  Organisms  in  Samples 

When  the  conditions  of  sampling  justify  the  use  of 
the  Poisson  series,  the  number  of  samples  containing 
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o,  i,  2,  .  .  .  organisms  is,  as  we  have  seen,  connected 
by  a  calculable  relation  with  the  mean  number  of 
organisms  in  the  sample.  With  motile  organisms,  or 
in  other  cases  which  do  not  allow  of  discrete  colony 
formation,  the  mean  number  of  organisms  in  the 
sample  may  be  inferred  from  the  proportion  of  fertile 
cultures,  provided  a  single  organism  is  capable  of 
developing.  If  m  is  the  mean  number  of  organisms  in 
the  sample,  the  proportion  of  samples  containing  none, 
that  is  the  proportion  of  sterile  samples,  is  e~'",  from 
which  relation  we  can  calculate,  as  in  the  following 
table,  the  mean  number  of  organisms  corresponding 
to  10  per  cent,  20  per  cent,  etc.,  fertile  samples  : 

TABLE  8 
Percentage  of 

fertile  samples      10       20       30       40       50       60        70         80         90 
Mean  number 

of  organisms    '1054 -2232 -3567 -5108 -6932 -9163  1*2040  1-6095  2*3026 

In  connexion  with  the  use  of  the  above  table  it 
is  worth  noting  that  for  a  given  number  of  samples 
tested  the  percentage  is  most  accurately  determined 
at  50  per  cent,  but  for  the  minimum  percentage  error 
in  the  estimate  of  the  number  of  organisms,  nearly 
80  per  cent  fertile  or  i-6  organism  per  sample  is  most 
accurate.  At  this  point  the  standard  error  of  sampling 
may  be  reduced  to  10  per  cent  by  taking  about  155 
samples,  whereas  at  50  per  cent,  to  obtain  the  same 
accuracy,  208  samples  would  be  required. 

The  Poisson  series  also  enables  us  to  calculate 
what  percentage  of  the  fertile  cultures  obtained  have 
been  derived  from  a  single  organism,  for  the  percentage 
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of  impure  cultures,  i.e.  those  derived  from  2  or  more 
organisms,  can  be  calculated  from  the  percentage  of 
cultures  which  proved  to  be  fertile.  If  e~m  are  sterile, 
me~m  will  be  pure  cultures,  and  the  remainder  impure. 
The  following  table  gives  representative  values  of 
the  percentage  of  cultures  which  are  fertile,  and  the 
percentage  of  fertile  cultures  which  are  impure  : 

TABLE  9 

Mean  number  of  organisms 
in  sample         .  .  *i         *2         -3         -4         -5         -6         -7 

Percentage  fertile  .  .    9*52   18-13  25-92  32-97  39-35  45-12  50-34 

Percentage  of  fertile  cul- 
tures impure    .  .  .    4-92     9-67   14-25    18-67  22-92  27-02  30-95 

If  it  is  desired  that  the  cultures  should  be  pure  with 
high  probability,  a  sufficiently  low  concentration  must 
be  used  to  render  at  least  nine-tenths  of  the  samples 
sterile. 

18.  The  Binomial  Distribution 

The  binomial  distribution  is  well  known  as  the  first 
example  of  a  theoretical  distribution  to  be  established. 
It  was  found  by  Bernoulli,  about  the  beginning  of  the 
eighteenth  century,  that  if  the  probability  of  an  event 
occurring  were/>  and  the  probability  of  it  not  occurring 
were  q{  =  1  -  p),  then  if  a  random  sample  of  n  trials 
were  taken,  the  frequencies  with  which  the  event 
occurred  o,  I,  2,  .  .  .,  n  times  were  given  by  the 
expansion  of  the  binomial 

{q  +PY- 

This  rule  is  a  particular  case  of  a  more  general 
theorem  dealing  with  cases  in  which  not  only  a  simple 
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alternative  is  considered,  but  in  which  the  event  may 
happen  in  s  ways  with  probabilities  p1}  p2,  .  .  .,  p/} 
then  it  can  be  shown  that  the  chance  of  a  random 
sample  of  n  giving  ax  of  the  first  kind,  a2  of  the  second, 
.   .   .,  as  of  the  last  is 

n\ 

Piaip2a>  •  .  •  p;\ 


ax\a2\   .   .   .  as\ 

which  is  the  general  term  in  the  multinomial  expansion 
of 

(Pi  +P2  +    ■     ■     •  +PsT- 

Ex.  5.  Binomial  distribution  given  by  dice  records. 
— In  throwing  a  true  die  the  chance  of  scoring  more 
than  4  is  1/3,  and  if  12  dice  are  thrown  together  the 
number  of  dice  scoring  5  or  6  should  be  distributed 
with  frequencies  given  by  the  terms  in  the  expansion 
of 

If,  however,  one  or  more  of  the  dice  were  not  true,  but 
if  all  retained  the  same  bias  throughout  the  experiment, 
the  frequencies  should  be  given  approximately  by 

(7  +P)12> 

where  p  is  a  fraction  to  be  determined  from  the  data. 
The  following  frequencies  were  observed  (Weldon's 
data)  in  an  experiment  of  26,306  throws. 


[Table 
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TABLE  10 


Number  of 

Dice  with  5 

or  6. 

Observed 
Frequency. 

Expected 
True  Dice. 

Expected 

Biassed 

Dice. 

X1 

Measure  of  Divergence  — • 
tn 

True  Dice. 

Biassed  Dice. 

O 

I 
2 
3 

4 
5 
6 

7 
8 

9 
10 
11 
12 

185 
1 149 
3265 

5475 
6114 

5194 
3067 

i33i 

403 

105 

14 

4 

202-75 
I2I6-50 

3345*37 

5575-61 

6272-56 

5018-05 

2927-20 

1254-51 

392-04 

87-12 

13-07 

1-19 

•05 

I87-38 
II46-5I 

32I5-24 

5464-70 

6269-35 

5II4-65 

3042-54 

I329-73 

423-76 

96-03 

14-69 

1-361 
•06J 

1-554 
3-745 
1-931 
1-815 
4-008 
6-169 

6-677 
4-664 

•306 
3-670 

•066 
6-143 

•030 
•OO5 
•770 
•OI9 

3-849 
I-23I 
•197 
•OOI 
I-OI7 
•838 
•O32 

4-688 

26306 

26306-02 

26306-00 

40-748 
n  =  11 

12-677 
n  =  10 

It  is  apparent  that  the  observations  are  not  com- 
patible with  the  assumption  that  the  dice  were  un- 
biassed. With  true  dice  we  should  expect  more  cases 
than  have  been  observed  of  o,  i,  2,  3,  4,  and  less  cases 
than  have  been  observed  of  5,  6,  .  .  .,  11  dice  scoring 
more  than  four.  The  same  conclusion  is  more  clearly 
brought  out  in  the  fifth  column,  which  shows  the 
values  of  the  measure  of  divergence 


x' 
m 


where  m  is  the  expected  value  and  x  the  difference 
between  the  expected  and  observed  values.  The 
aggregate  of  these  values  is  %*,  which  measures  the 
deviation  of  the  whole  series  from  the  expected  series 
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of  frequencies,  and  the  actual  chance  of  x2  exceeding 
40-75,  the  value  for  the  hypothesis  that  the  dice  are 
true,  is  -00003. 

The  total  number  of  times  in  which  a  die  showed 
5  or  6  was  106,602,  out  of  315,672  trials,  whereas  the 
expected  number  with  true  dice  is  105,224  ;  from  the 
former  number,  the  value  of  p  can  be  calculated,  and 
proves  to  be  -337,698,6,  and  hence  the  expectations  of 
the  fourth  column  were  obtained.  These  values  are 
much  more  close  to  the  observed  series,  and  indeed  fit 
them  satisfactorily,  showing  that  the  conditions  of  the 
experiment  were  really  such  as  to  give  a  binomial  series. 

The  standard  deviation  of  the  binomial  series  is 


V ' pqn.  Thus  with  true  dice  and  315,672  trials  the 
expected  number  of  dice  scoring  more  than  4  is  105,224 
with  standard  error  264-9  ;  the  observed  number 
exceeds  expectation  by  1378,  or  5-20  times  its  standard 
error ;  this  is  the  most  sensitive  test  of  the  bias,  and 
it  may  be  legitimately  applied,  since  for  such  large 
samples  the  binomial  distribution  closely  approaches 
the  normal.  From  the  table  of  the  probability  integral 
it  appears  that  a  normal  deviation  only  exceeds  5-2 
times  its  standard  error  once  in  5  million  times. 

The  reason  why  this  last  test  gives  so  much  higher 
odds  than  the  test  for  goodness  of  fit,  is  that  the  latter 
is  testing  for  discrepancies  of  any  kind,  such,  for 
example,  as  copying  errors  would  introduce.  The 
actual  discrepancy  is  almost  wholly  due  to  a  single 
item,  namely,  the  value  of  p,  and  when  that  point 
is  tested  separately  its  significance  is  more  clearly 
brought  out, 
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Ex.  6.  Comparison  of  sex  ratio  in  human  families 
with  binomial  distribution. — Biological  data  are  rarely 
so  extensive  as  this  experiment  with  dice  ;  Geissler's 
data  on  the  sex  ratio  in  German  families  will  serve 
as  an  example.  It  is  well  known  that  male  births 
are  slightly  more  numerous  than  female  births,  so 
that  if  a  family  of  8  is  regarded  as  a  random  sample 
of  eight  from  the  general  population,  the  number  of 
boys  in  such  families  should  be  distributed  in  the 
binomial 

(?  +P)\ 

where  p  is  the  proportion  of  boys.  If,  however, 
families  differ  not  only  by  chance,  but  by  a  tendency 
on  the  part  of  some  parents  to  produce  males  or 
females,  then  the  distribution  of  the  number  of  boys 
should  show  an  excess  of  unequally  divided  families, 
and  a  deficiency  of  equally  or  nearly  equally  divided 
families.     The  data  in  Table   1 1   show  that  there  is 

TABLE  11 


Number  of 

Number  of  Boys. 

Families 
Observed. 

Expected. 

Excess  (x). 

m 

O 

215 

I65-22 

+     49*78 

15-604 

I 

I485 

I4OP69 

+     83-31 

4-952 

2 

5331 

5202-65 

+  128-35 

3-i66 

3 

IO649 

II034-65 

-385-65 

13-478 

4 

14959 

14627-60 

+  331-40 

7-508 

5 

I  1929 

12409*87 

-480-87 

18-633 

6 

6678 

6580-24 

+   97-76 

1-452 

7 

2092 

I993-78 

+   98-22 

4-839 

8 

342 

264-30 

+   77-70 

22-843 

53680 

53680-00 

92-475 
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evidently  such  an  excess  of  very  unequally  divided 
families. 

The  observed  series  differs  from  expectation 
markedly  in  two  respects :  one  is  the  excess  of  unequally 
divided  families  ;  the  other  is  the  irregularity  of  the 
central  values,  showing  an  apparent  bias  in  favour  of 
even  values.  No  biological  reason  is  suggested  for 
the  latter  discrepancy,  which  therefore  detracts  from 
the  value  of  the  data.  The  excess  of  the  extreme 
types  of  family  may  be  treated  in  more  detail  by 
comparing  the  observed  with  the  expected  variance. 
The  expected  variance,  npq}  is  1-998,28,  while  that 
calculated  from  the  data  is  2-067,42,  showing  an 
excess  of  06914,  or  3-46  per  cent.  The  standard 
error  of  the  variance  is 


J 


-Pi 


N 


where  N  is  the  number  of  samples,  and  /x2  and  /a4 
are  the  second  and  fourth  moments  of  the  theoretical 
distribution,  namely, 

/x2  =  npqt 

[ix=in2p2q2+npq{\  -6pq), 
so  that 

/a4  -[jL22=2n2p2q2+npq(i  -6pq). 

The  approximate  values  of  these  two  terms  are  8 
and  -1,  giving  +7,  the  actual  value  being  698966. 
Hence  the  standard  error  of  the  variance  is  -01141  ; 
the  discrepancy  is  over  six  times  its  standard  error. 

One  possible  cause  of  the  excessive  variation  lies 
in  the  occurrence  of  multiple  births,  for  it  is  known 
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that  children  of  the  same  birth  tend  to  be  of  the  same 
sex.  The  multiple  births  are  not  separated  in  these 
data,  but  an  idea  of  the  magnitude  of  this  effect  may 
be  obtained  from  other  data  for  the  German  Empire. 
These  show  about  12  twin  births  per  thousand,  of 
which  §  are  of  like  sex  and  |  of  unlike,  so  that  one- 
quarter  of  the  twin  births,  3  per  thousand,  may  be 
regarded  as  "  identical  "  in  respect  of  sex.  Six 
children  per  thousand  would  therefore  probably  belong 
to  such  "  identical  "  twin  births,  the  additional  effect 
of  triplets,  etc.,  being  small.  Now  with  a  population 
of  identical  twins  it  is  easy  to  see  that  the  theo- 
retical variance  is  doubled  ;  consequently,  to  raise  the 
variance  by  3-46  per  cent  we  require  that  3-46  per  cent 
of  the  children  should  be  "  identical  "  twins  ;  this  is 
more  than  five  times  the  general  average,  and  although 
it  is  probable  that  the  proportion  of  twins  is  higher  in 
families  of  8  than  in  the  general  population,  we  cannot 
reasonably  ascribe  more  than  a  fraction  of  the  excess 
variance  to  multiple  births. 


19.  Small  Samples  of  the  Binomial  Series 

With  small  samples,  such  as  ordinarily  occur  in 
experimental  work,  agreement  with  the  binomial 
series  cannot  be  tested  with  much  precision  from  a 
single  sample.  It  is,  however,  possible  to  verify  that 
the  variation  is  approximately  what  it  should  be, 
by  calculating  an  index  of  dispersion  similar  to  that 
used  for  the  Poisson  series. 

Ex.  7.    The  accuracy  of  estimates  of  infestation. — 
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The  proportion  of  barley  ears  infected  with  gout- 
fly  may  be  ascertained  by  examining  ioo  ears,  and 
counting  the  infected  specimens  ;  if  this  is  done  re- 
peatedly, the  numbers  obtained,  if  the  material  is 
homogeneous,  should  be  distributed  in  the  binomial 

(q  +p)™ 

where  p  is  the  proportion  infested,  and  q  the  propor- 
tion free  from  infestation.  The  following  are  the  data 
from  10  such  observations  made  on  the  same  plot 
(J.  G.  H.  Frew's  data)  : 

16,  18,  ii,  18,  21,  io,  20,  18,  17,  21.         Mean  17-0. 

Is  the  variability  of  these  numbers  ascribable  to 
random  sampling  ;  i.e.  Is  the  material  apparently 
homogeneous  ?  Such  data  differ  from  those  to  which 
the  Poisson  series  is  appropriate,  in  that  a  fixed  total 
of  100  is  in  each  case  divided  into  two  classes,  infected 
and  not  infected,  so  that  in  taking  the  variability  of 
the  infected  series  we  are  equally  testing  the  variability 
of  the  series  of  numbers  not  infected.  The  modified 
form  of  x2,  the  index  of  dispersion,  appropriate  to  the 
binomial  is 

nfiq  xq 

differing  from  the  form  appropriate  to  the  Poisson 
series  in  containing  the  divisor  q,  or  in  this  case,  -83. 
The  value  of  %2  is  9-22,  which,  as  the  x2  table  shows,  is 
a  perfectly  reasonable  value  for  n=g,  one  less  than 
the  number  of  values  available. 

Such  a  test  of  the  single  sample,  is,  of  course,  far 
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from  conclusive,  since  %2  may  vary  within  wide  limits. 
If,  however,  a  number  of  such  small  samples  are 
available,  though  drawn  from  plots  of  very  different 
infestation,  we  can  test,  as  with  the  Poisson  series,  if 
the  general  trend  of  variability  accords  with  the 
binomial  distribution.  Thus  from  20  such  plots  the 
total  x2  is  J93'64,  while  S(n)  is  180.  Testing  as  before 
(p.  61),  we  find 

V387-28  =  19-68 

V35~9  =  18-95 


Difference      +*73 

The  difference  being  less  than  one,  we  conclude 
that  the  variance  shows  no  sign  of  departure  from  that 
of  the  binomial  distribution.  The  difference  between 
the  method  appropriate  for  this  case,  in  which  the 
samples  are  small  (10),  but  each  value  is  derived  from 
a  considerable  number  (100)  of  observations,  and  that 
appropriate  for  the  sex  distribution  in  families  of  8, 
where  we  had  many  families,  each  of  only  8  observa- 
tions, lies  in  the  omission  of  the  term 

npq{\  -6pq) 

in  calculating  the  standard  error  of  the  variance. 
When  n  is  100  this  term  is  very  small  compared  to 
2n2p2q2,  and  in  general  the  x2  method  is  highly  accurate 
if  the  number  in  all  the  observational  categories  is  as 
high  as  10. 
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Appendix  of  Technical  Notation  and  Formulae 
A.   Definition  of  moments  of  sample. 

The  following  statistics  are  known  as  the  first  four 
moments  of  the  variate  x  ;  the  first  moment  is  the 
mean 

fi1=x=S(x)  ; 

n 

the  second  and  higher  moments  are  the  mean  values 
of  the  second  and  higher  powers  of  the  deviations 
from  the  mean 


/X2  =  -  S(X  -  X)2,  fJL3  =    S(#  -  x)3,  /x4  =  -  S(*  -  x)4. 
n  n  n 


B.    Mojnents    of    theoretical   distribution    in    terms    of 
parameters. 


Symbol. 

Normal. 

Poisson. 

Binomial. 

Mean     . 

H 

m 

m 

?lp 

Variance 

1*2 

<T* 

m 

npq 

tH 

0 

in 

-npq{p-q) 

fH-3fx22 

0 

m 

npq(i-dpq) 

yi=/V/v 

0 

\\y/m 

-(P- 9) IV ??/></ 

72=(/A4-3/VJ)//VJ 

0 

\\m 

{i-bpq)lnpq 
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C.    Variance  of  moments  derived  from  samples  of  N. 


General 
Form. 

Normal. 

Poisson. 

Binomial. 

Variance  of 
mean 

N 

a2 

N 

m 

N 

npq      . 

N 

Variance  of 
variance 

£W-/*28) 

2(T4 

N 

m{2m  +  i) 

N 

£{2»^y+»/>*(i-6^)} 

Variance  of 
standard 
deviation 

4N(2  +  7>2 

<r2 

2N 

2m  +  I 

-1(2^+1-6^} 

4N 

Variance  of 

6 

7i 

N 

Variance  of 

24 

72 

N 

D.   Corrections  in  calculating  moments. 

(a)  Correction  for  mean,  if  v  is  the  moment  about  the 
working  mean,  and  v  the  corresponding  value  corrected  to 
the  true  mean  : 

'  '    2 

V2=V  2~V  i", 

t       i 


Va  =V 


3P  !»  2+2y  !J, 

41/  ji;  3  +01/  XV  2  ~3"  1  • 


(3)  Correction  for  grouping,  if  v  is  the  estimate  uncorrected 
for  grouping,  and  /x  the  corresponding  estimate  corrected  : 


Ms  =  ^3> 
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IV 


TESTS   OF   GOODNESS   OF   FIT,   INDEPENDENCE 
AND   HOMOGENEITY;    WITH   TABLE  OF  X2 

20.  The  x2  Distribution 

In  the  last  chapter  some  use  has  been  made  of  the 
X2  distribution  as  a  means  of  testing  the  agreement 
between  observation  and  hypothesis  ;  in  the  present 
chapter  we  shall  deal  more  generally  with  the  very  wide 
class  of  problems  which  may  be  solved  by  means  of 
the  same  distribution. 

The  element  common  to  all  such  tests  is  the  com- 
parison of  the  numbers  actually  observed  to  fall  into 
any  number  of  classes  with  the  numbers  which  upon 
some  hypothesis  are  expected.  If  m  is  the  number 
expected,  and  m+x  the  number  observed,  in  any 
class,  we  calculate 

*-■©■ 

the  summation  extending  over  all  the  classes.  This 
formula  gives  the  value  of  ^2,  and  it  is  clear  that  the 
more  closely  the  observed  numbers  agree  with  those 
expected  the  smaller  will  x2  De  I  m  order  to  utilise  the 
table  it  is  necessary  to  know  also  the  value  of  n  with 
which  the  table  is  to  be  entered.     The  rule  for  finding 
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n  is  that  n  is  equal  to  the  number  of  degrees  of  freedom 
in  which  the  observed  series  may  differ  from  the 
hypothetical  ;  in  other  words,  it  is  equal  to  the  number 
of  classes  the  frequencies  in  which  may  be  filled  up 
arbitrarily.  Several  examples  will  be  given  to  illus- 
trate this  rule. 

For  any  value  of  n,  which  must  be  a  whole  number, 
the  form  of  distribution  of  x2  was  established  by  Pearson 
in  1900  ;  it  is  therefore  possible  to  calculate  in  what 
proportion  of  cases  any  value  of  x2  will  De  exceeded. 
This  proportion  is  represented  by  P,  which  is  there- 
fore the  probability  that  x2  shall  exceed  any  specified 
value.  To  every  value  of  x2  there  thus  corresponds  a 
certain  value  of  P  ;  as  x2  is  increased  from  o  to  infinity, 
P  diminishes  from  1  to  o.  Equally,  to  any  value  of 
P  in  this  range  there  corresponds  a  certain  value  of  x2- 
Algebraically  the  relation  between  these  two  quantities 
is  a  complex  one,  so  that  it  is  necessary  to  have  a  table 
of  corresponding  values,  if  the  x2  test  is  to  be  available 
for  practical  use. 

An  important  table  of  this  sort  was  prepared  by 
Elderton,andis  known  as  Elderton's  Table  of  Goodness 
of  Fit.  Elderton  gives  the  values  of  P  to  six  decimal 
places  corresponding  to  each  integral  value  of  %2  from 
1  to  30,  and  thence  by  tens  to  70.  In  place  of  n,  the 
quantity  ri  (=«  +  i)  was  used,  since  it  was  then 
believed  that  this  could  be  equated  to  the  number  of 
frequency  classes.  Values  of  ri  from  3  to  30  were 
given,  these  corresponding  to  values  of  n  from  2  to 
29.  A  table  for  ri  =  2,  or  n  =  i,  was  subsequently 
supplied   by   Yule.     Owing  to   copyright  restrictions 
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we  have  not  reprinted  Elderton's  table,  but  have  given 
a  new  table  (Table  III.  p.  96)  in  a  form  which  experi- 
ence has  shown  to  be  more  convenient.  Instead  of 
giving  the  values  of  P  corresponding  to  an  arbitrary 
series  of  values  of  %2,  we  have  given  the  values  of  x2 
corresponding  to  specially  selected  values  of  P.  We 
have  thus  been  able  in  a  compact  form  to  cover  those 
parts  of  the  distributions  which  have  hitherto  not 
been  available,  namely,  the  values  of  x2  less  than  unity, 
which  frequently  occur  for  small  values  of  n,  and  the 
values  exceeding  30,  which  for  larger  values  of  n 
become  of  importance. 

It  is  of  interest  to  note  that  the  measure  of  disper- 
sion, </>2,  introduced  by  the  German  economist,  Lexis, 
is,  if  accurately  calculated,  equivalent  to  x2ln  °f  our 
notation.  In  the  many  references  in  English  to  the 
method  of  Lexis,  it  has  not,  I  believe,  been  noted  that 
the  discovery  of  the  distribution  of  x2  m  reality  com- 
pleted the  method  of  Lexis.  If  it  were  desired  to  use 
Lexis'  notation,  our  table  could  be  transformed  into  a 
table  of  <j>2  merely  by  dividing  each  entry  by  n. 

In  preparing  this  table  we  have  borne  in  mind  that 
in  practice  we  do  not  want  to  know  the  exact  value  of 
P  for  any  observed  ^2,  but,  in  the  first  place,  whether 
or  not  the  observed  value  is  open  to  suspicion.  If  P 
is  between  -i  and  -9  there  is  certainly  no  reason  to 
suspect  the  hypothesis  tested.  If  it  is  below  -02  it  is 
strongly  indicated  that  the  hypothesis  fails  to  account 
for  the  whole  of  the  facts.  We  shall  not  often  be  astray 
if  we  draw  a  conventional  line  at  -05,  and  consider  that 
higher  values  of  x2  indicate  a  real  discrepancy. 
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To  compare  values  of  x2,  or  of  P,  by  means  of  a 
probable  error  "  is  merely  to  substitute  an  inexact 
(normal)  distribution  for  the  exact  distribution  given 
by  the  x2  table. 

The  term  Goodness  of  Fit  has  caused  some  to  fall 
into  the  fallacy  of  believing  that  the  higher  the  value 
of  P  the  more  satisfactorily  is  the  hypothesis  verified. 
Values  over  -999  have  sometimes  been  reported  which, 
if  the  hypothesis  were  true,  would  only  occur  once 
in  a  thousand  trials.  Generally  such  cases  are  demon- 
strably due  to  the  use  of  inaccurate  formulae,  but 
occasionally  small  values  of  x2  beyond  the  expected 
range  do  occur,  as  in  Ex.  4  with  the  colony  numbers 
obtained  in  the  plating  method  of  bacterial  counting. 
In  these  cases  the  hypothesis  considered  is  as  definitely 
disproved  as  if  P  had  been  -ooi. 

When  a  large  number  of  values  of  x2  are  available 
for  testing,  it  may  be  possible  to  reveal  discrepancies 
which  are  too  small  to  show  up  in  a  single  value  ;  we 
may  then  compare  the  observed  distribution  of  x2  with 
that  expected.  This  may  be  done  immediately  by 
simply  distributing  the  observed  values  of  x2  among 
the  classes  bounded  by  values  given  in  the  x2  table, 
as  in  Ex.  4,  p.  59.  The  expected  frequencies  in 
these  classes  are  easily  written  down,  and,  if  necessary, 
the  x2  test  may  be  used  to  test  the  agreement  of  the 
observed  with  the  expected  frequencies. 

It  is  useful  to  remember  that  the  sum  of  any  number 
of  quantities,  %2,  is  distributed  in  the  %2  distribution, 
with  n  equal  to  the  sum  of  the  values  of  n  correspond- 
ing to  the  values  of  x2  used.     Such  a  test  is  sensitive, 
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and  will  often  bring  to  light  discrepancies  which  are 
hidden  or  appear  obscurely  in  the  separate  values. 

The  table  we  give  has  values  of  n  up  to  30  ; 
beyond  this  point  it  will  be  found  sufficient  to  assume 
that  V^x2  1S  distributed  normally  with  unit  standard 
deviation  about  a  mean  \^2n  -  1.  The  values  of  P 
obtained  by  applying  this  rule  to  the  values  of  %2  given 
for  n  =  30,  may  be  worked  out  as  an  exercise.  The 
errors  are  small  for  n  =  30,  and  become  progressively 
smaller  for  higher  values  of  n. 

Ex.  8.  Comparison  with  expectation  of  Mendelian 
class  frequencies. — In  a  cross  involving  two  Mendelian 
factors  we  expect  by  interbreeding  the  hybrid  (Fx) 
generation  to  obtain  four  classes  in  the  ratio  9:3:3:1; 
the  hypothesis  in  this  case  is  that  the  two  factors 
segregate  independently,  and  that  the  four  classes 
of  offspring  are  equally  viable.  Are  the  following 
observations  on  Primula  (de  Winton  and  Bateson)  in 
accordance  with  this  hypothesis  ? 

TABLE  12 


Flat  Leaves. 

Crimped  Leaves. 

Total. 

Normal  Eye. 

Primrose 
Queen  Eye. 

Lee's  Eye. 

Primrose 
Queen  Eye. 

Observed  (m  +x) 
Expected  (m) 
x2/m    . 

328 
315 

•537 

122 
105 

2-752 

77 

105 

7-467 

33 

35 

•114 

560 

560 

IO-870 

The  expected  values  are  calculated  from  the 
observed  total,  so  that  the  four  classes  must  agree  in 
their  sum,  and  if  three  classes  are  filled  in  arbitrarily 
the    fourth    is    therefore    determinate ;    hence    n  =  3 ; 
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X2  =  10-87,  tne  chance  of  exceeding  which  value  is 
between  -oi  and  -02  ;  if  we  take  P  =-05  as  the  limit 
of  significant  deviation,  we  shall  say  that  in  this 
case  the  deviations  from  expectation  are  clearly- 
significant. 

Let  us  consider  a  second  hypothesis  in  relation 
to  the  same  data,  differing  from  the  first  in  that  we 
suppose  that  the  plants  with  crimped  leaves  are  to 
some  extent  less  viable  than  those  with  flat  leaves. 
vSuch  a  hypothesis  could  of  course  be  tested  by  means 
of  additional  data  ;  we  are  only  here  concerned  with 
the  question  whether  or  no  it  accords  with  the  values 
before  us.  The  hypothesis  tells  us  nothing  of  what 
degree  of  relative  viability  to  expect ;  we  therefore  take 
the  totals  of  flat  and  crimped  leaves  observed,  and 
divide  each  class  in  the  ratio  3:1. 


TABLE  13 


Flat  Leaves. 

Crimped  Leaves. 

xs. 

Normal  Eye. 

Primrose 
Queen  Eye. 

Lee's  Eye. 

Primrose 
Queen  Eye. 

Observed    . 
Expected    . 
x2/m   . 

328 

337*5 
•267 

122 
II2-5 
•804 

77 
82-5 
•367 

33 
27-5 
i*  109 

2-547 

The  value  of  n  is  now  2,  since  only  two  entries  can 
be  made  arbitrarily  ;  the  value  of  x2>  however,  is  so 
much  reduced  that  P  exceeds  -2,  and  the  departure 
from  expectation  is  no  longer  significant.  The  sig- 
nificant part  of  the  original  discrepancy  lay  in  the 
proportion  of  flat  to  crimped  leaves. 

It  was  formerly  believed  that  in  entering  the  x2 
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table  n  was  always  to  be  equated  to  one  less  than  the 
number  of  frequency  classes  ;  this  view  led  to  many 
discrepancies,  and  has  since  been  disproved  with  the 
establishment  of  the  rule  stated  above.  On  the  old 
view,  any  complication  of  the  hypothesis  such  as  that 
which  in  the  above  instance  admitted  differential 
viability,  was  bound  to  give  an  apparent  improvement 
in  the  agreement  between  observation  and  hypothesis. 
When  the  change  in  n  is  allowed  for,  this  bias  dis- 
appears, and  if  the  value  of  P,  rightly  calculated,  is 
many  fold  increased,  as  in  this  instance,  the  increase 
may  safely  be  ascribed  to  an  improvement  in  the 
hypothesis,  and  not  to  a  mere  increase  of  available 
constants. 

Ex.  9.  Comparison  with  expectation  of  the  Poisson 
series  and  Binomial  series. — In  Table  5,  p.  56,  we 
give  the  observed  and  expected  frequencies  in  the  case 
of  a  Poisson  series.  In  applying  the  x2  test  t0  such 
a  series  it  is  desirable  that  the  number  expected  should 
in  no  group  be  less  than  5,  since  the  calculated  distribu- 
tion of  x2  is  not  very  closely  realised  for  very  small 
classes.  We  therefore  pool  the  numbers  for  o  and  1 
cells,  and  also  those  for  10  and  more,  and  obtain  the 
following  comparison  : 

TABLE   14 


0  and  1234567 

8                  10  and 
more 

Total 

Observed 

20        43       53        86        70       54        37 

18         10           9 

400 

Expected 

21-08  40-65  63-41  74-19  6944  54-16  36-21 

21-18  ii-o2     8-66 

400 

x2i'm 

•055      -136    1-709  i-88o    -005     -005     -017 

•477     -093      -013 

4-390 
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Using  10  frequency  classes  we  have  x2  =4' 39°  I  in 
ascertaining  the  value  of  n  we  have  to  remember  that 
the  expected  frequencies  have  been  calculated,  not 
only  from  the  total  number  of  values  observed  (400), 
but  also  from  the  observed  mean  ;  there  remain,  there- 
fore, 8  degrees  of  freedom  and  n  =  8.  For  this  value 
the  x2  table  shows  that  P  is  between  -8  and  -9,  showing 
a  close  but  not  an  unreasonably  close,  agreement  with 
expectation. 

Similarly  in  Table  10,  p.  65,  we  have  given  the 
value  of  x2  based  upon  12  classes  for  the  two  hypo- 
theses of  "  true  dice  "  and  "  biassed  dice  "  ;  with 
"  true  dice  "  the  expected  values  are  calculated  from 
the  total  number  of  observations  alone,  and  n  =  11,  but 
in  allowing  for  bias  we  have  brought  also  the  means 
into  agreement  so  that  n  is  reduced  to  10.  In  the  first 
case  x2  is  far  outside  the  range  of  the  table  showing  a 
highly  significant  departure  from  expectation  ;  in  the 
second  it  appears  that  P  lies  between  -2  and  -3,  so  that 
the  value  of  x2  *s  within  the  expected  range. 


21.  Tests  of  Independence,  Contingency  Tables 

A  special  and  important  class  of  cases  where  the 
agreement  between  expectation  and  observation  may 
be  tested  comprises  the  tests  of  independence.  If  the 
same  group  of  individuals  is  classified  in  two  (or 
more)  different  ways,  as  persons  may  be  classified  as 
inoculated  and  not  inoculated,  and  also  as  attacked 
and  not  attacked  by  a  disease,  then  we  may  require  to 
know  if  the  two  classifications  are  independent. 
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In  the  simplest  case,  when  each  classification 
comprises  only  two  classes,  we  have  a  2  x  2  table,  or, 
as  it  is  often  called,  a  fourfold  table. 

Ex.  10.  The  following  table  is  taken  from  Green- 
wood and  Yule's  data  for  Typhoid : 

TABLE  15 
Observed 


Attacked. 

Not  Attacked. 

Total. 

Inoculated  . 

56 

6,759 

6,815 

Not  Inoculated  . 

272 

11,396 

11,668 

Total      . 

.         328 

18,155 

18,483 

TABLE   16 
Expected 


Attacked. 

Not  Attacked. 

Total. 

Inoculated  . 

I  20-93 

6,694-07 

6,815 

Not  Inoculated  . 

207-07 

II,460-93 

11,668 

Total      . 

328                        18,155 

18,483 

In  testing  independence  we  must  compare  the 
observed  values  with  values  calculated  so  that  the  four 
frequencies  are  in  proportion  ;  since  we  wish  to  test 
independence  only,  and  not  any  hypothesis  as  to  the 
total  numbers  attacked,  or  inoculated,  the  "  expected  " 
values  are  calculated  from  the  marginal  totals  observed, 
so  that  the  numbers  expected  agree  with  the  numbers 
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observed  in  the  margins  ;  only  one  value  need  be 
calculated,  e.g. 

328  X6815 

=120-93  ; 

18483 

the  others  are  written  down  at  once  by  subtraction 
from  the  margins.  It  is  thus  obvious  that  the  observed 
values  can  differ  from  those  expected  in  only  1  degree 
of  freedom,  so  that  in  testing  independence  in  a  four- 
fold table,  n  =  i.  Since  %2  =56-234  the  observations 
are  clearly  opposed  to  the  hypothesis  of  independence. 
Without  calculating  the  expected  values,  x2  may,  for 
fourfold  tables,  be  directly  calculated  by  the  formula 

2_   (ad-bc)\a  +  b  +c  +  d) 
X  " \a  +  b)(c  +d)(a  +  c)(b  +d)' 

where  a,  3,  c,  and  d  are  the  four  observed  numbers. 

When  only  one  of  the  classifications  is  of  two 
classes,  the  calculation  of  %2  may  be  simplified  to  some 
extent,  if  it  is  not  desired  to  calculate  the  expected 
numbers.  If  a,  a'  represent  any  pair  of  observed 
frequencies,  and  n,  n'  the  corresponding  totals,  we 
calculate  from  each  pair 

-(an'  -a'n)2, 


a  +a 


and  the  sum  of  these  quantities  divided  by  nn'  will 
be%2. 

Ex.  11.  Test  of  independence  in  a  2*n  classifica- 
tion.—^  From  the  pigmentation  survey  of  Scottish 
children  (Tocher's  data)  the  following  are  the  numbers 
of  boys  and  girls  from  the  same  district  (No.  1)  whose 
hair  colour  falls  into  each  of  five  classes  : 
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TABLE  17 
Hair  Colour 


Fair. 

Red. 

Medium. 

Dark. 

Jet  Black. 

Total. 

Boys  . 
Girls  . 

592 

544 

119 

97 

849 
677 

504 

451 

36 

14 

2IOO 
1783 

'  Total      . 

6-642 

216 

•333 
to 

1526 

5*555 

955 
2-460 

50 
24-204 

3883 

39-194 

The     quantities    calculated     from    each    pair    of 
observations  are  given  below  in  millions.     Thus 

(544  x2ioo  -  592  x  1 783)2  =6,642,000 


1136 


approximately  ;  the  total  of  39  millions  odd  divided  by 
2100  and  by  1783  gives  %2  =  10-468.  In  this  table 
4  values  could  be  filled  in  arbitrarily  without  con- 
flicting with  the  marginal  totals,  so  that  ^=4.  The 
value  of  P  is  between  -02  and  -05,  so  that  sex  difference 
in  the  classification  by  hair  colours  is  probably 
significant  as  judged  by  this  district  alone.  The 
calculation  of  %2  from  "  expected  "  values,  though 
somewhat  more  laborious,  would  have  in  this  case  the 
advantage  of  showing  in  which  classes  the  boys,  and 
in  which  classes  the  girls,  were  in  excess.  It  appears 
from  the  numbers  in  the  lowest  line  that  the  principle 
discrepancy  is  in  the  "  Jet  Black  "  class. 

Ex.  12.  Test  of  independence  in  a  ^x  4.  classifica- 
tion.— As  an  example  of  a  more  complex  contingency 
table   *ve  may  take  the  results  of  a  series  of  back- 
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crosses    in    mice,    involving    the    two    factors    Black 
Brown,  Self- Piebald  (Wachter's  data)  : 


TABLE  18 


Black  Self. 

Black  Piebald. 

Brown  Self. 

Brown  Piebald. 

Total. 

Coupling — 

Fx  Males  . 

Fx  Females 
Repulsion — 

Fx  Males  . 

Fx  Females 

88    (85-37) 
38    (34-43) 

115  (117-00) 
96  ( 1 00-20) 

82    (75-24) 
34    (3o-34) 

93(l03-") 

88    (88-31) 

75  (7o-93) 
30  (28-60) 

80  (97-21) 
95  (83-26) 

60     (73-46) 
21      (29-63) 

I30(lOO-68) 

79    (86-23) 

305 
123 

418 
358 

Total      . 

337 

297 

280 

290 

1204 

The  back-crosses  were  made  in  four  ways,  accord- 
ing as  the  male  or  female  parents  were  heterozygous 
(Fi)  in  the  two  factors,  and  according  to  whether  the 
two  dominant  genes  were  received  both  from  one 
(Coupling)  or  one  from  each  parent  (Repulsion). 

The  simple  Mendelian  ratios  may  be  disturbed  by 
differential  viability,  by  linkage,  or  by  linked  lethals. 
Linkage  is  not  suspected  in  these  data,  and  if  the 
only  disturbance  were  due  to  differential  viability  of  the 
four  genotypes,  these  should  always  appear  in  the 
same  proportion ;  to  test  if  the  data  show  significant 
departures  we  may  apply  the  %2  test  to  the  whole 
4x4  table.  The  values  expected  on  the  hypothesis 
that  the  proportions  are  independent  of  the  matings 
used,  or  that  the  four  series  are  homogeneous,  are 
given  above  in  brackets.  The  contributions  to  %2 
made  by  each  cell  are  given  on  page  87. 

The  value  of  %2  is  therefore  21-832  ;  the  value  of 
n  is  9,  for  we  could  fill  up  a  block  of  three  rows  and 
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three  columns  and  still  adjust  the  remaining  entries  to 
check  with  the  margins.  In  general  for  a  contingency 
table  of  r  rows  and  c  columns  n  =  (r  -  1)  (V  -  1).  For 
n  =  9,  the  value  of  %2  shows  that  P  is  less  than  -oi,  and 

TABLE  19 


•081 

•607 

•234 

2-466 

3-388 

.370 

•442 

•069 

2-514 

3*395 

•034 

•991 

3-047 

8-539 

12-611 

•176 

•OOI 

1-655 

•606 

2-438 

•661 

2-041 

5-005 

14-125 

21-832 

therefore  the  departures  from  proportionality  are  not 
fortuitous  ;  it  is  apparent  that  the  discrepancy  is  due 
to  the  exceptional  number  of  Brown  Piebalds  in  the  Fx 
males  repulsion  series. 

It  should  be  noted  that  the  methods  employed  in 
this  chapter  are  not  designed  to  measure  the  degree  of 
association  between  one  classification  and  another,  but 
solely  to  test  whether  the  observed  departures  from 
independence  are  or  are  not  of  a  magnitude  ascribable 
to  chance.  The  same  degree  of  association  may  be 
significant  for  a  large  sample  but  insignificant  for  a 
small  one  ;  if  it  is  insignificant  we  have  no  reason  on 
the  data  present  to  suspect  any  degree  of  association  at 
all,  and  it  is  useless  to  attempt  to  measure  it.  If,  on 
the  other  hand,  it  is  significant  the  value  of  %2  indi- 
cates the  fact,  but  does  not  measure  the  degree  of 
association.  Provided  the  deviation  is  clearly  signifi- 
cant, it  is  of  no  practical  importance  whether  P  is  -oi 
or  -000,001,  and  it  is  for  this  reason  that  we  have  not 
tabulated  the  value  of  x2  beyond  -oi.     To  measure 
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the  degree  of  association  it  is  necessary  to  have  some 
hypothesis  as  to  the  nature  of  the  departure  from 
independence  to  be  measured.  With  Mendelian  fre- 
quencies, for  example,  the  cross-over  percentage  may 
be  used  to  measure  the  degree  of  association  of  two 
factors,  and  the  significance  of  evidence  for  linkage 
may  be  tested  by  comparing  the  difference  between 
the  cross-over  percentage  and  50  per  cent  (the  value 
for  unlinked  factors),  with  its  standard  error.  Such  a 
comparison,  if  accurately  carried  out,  must  agree 
absolutely  with  the  conclusion  drawn  from  the  x2 
test.  To  take  a  second  example,  the  values  in  a  four- 
fold table  may  be  sometimes  regarded  as  due  to  the 
partition  of  a  normally  correlated  pair  of  variates, 
according  as  the  values  are  above  or  below  arbitrarily 
chosen  dividing-lines  ;  as  if  a  group  of  stature  measure- 
ments of  fathers  and  sons  were  divided  between  those 
above  and  those  below  68  inches.  In  this  case 
the  departure  from  independence  may  be  properly 
measured  by  the  correlation  in  stature  between  father 
and  son  ;  this  quantity  can  be  estimated  from  the 
observed  frequencies,  and  a  comparison  between  the 
value  obtained  and  its  standard  error,  if  accurately 
carried  out,  will  agree  with  the  x2  test  as  t0  tne  signifi- 
cance of  the  association  ;  the  significance  will  become 
more  and  more  pronounced  as  the  sample  is  increased 
in  size,  but  the  correlation  obtained  will  tend  to  a 
fixed  value.  The  x2  test  does  not  attempt  to  measure 
the  degree  of  association,  but  as  a  test  of  significance 
it  is  independent  of  all  additional  hypotheses  as  to  the 
nature  of  the  association. 
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Tests  of  homogeneity  are  mathematically  identical 
with  tests  of  independence  ;  the  last  example  may 
equally  be  regarded  in  either  light.  In  Chapter  III. 
the  tests  of  agreement  with  the  Binomial  series  were 
essentially  tests  of  homogeneity  ;  the  ten  samples  of 
100  ears  of  barley  (Ex.  7,  p.  70)  might  have  been  re- 
presented as  a  2  x  10  table.  The  x2  index  of  disper- 
sion would  then  be  equivalent  to  the  %2  obtained  from 
the  contingency  table.  The  method  of  this  chapter  is 
more  general,  and  is  applicable  to  cases  in  which  the 
successive  samples  are  not  all  of  the  same  size. 

Ex.  13.  Homogeneity  of  different  families  in 
respect  of  ratio  black  :  red. — The  following  data  show 
in  2>3  families  of  Gammarus  (Huxley's  data)  the 
numbers  with  black  and  red  eyes  respectively  : 

TABLE  20 


Black  79  120  24  117  62  79  66  45 
Red   14  31  6  29  17  20  12  11 

61 
14 

64  208 
13  52 

154  31  158  21 

45  4  45  4 

105   28 
28   7 

Total  93  151  30  146  79  99  78  56 

75 

77  260 

199  35  203  25 

133  35 

Black  58  81  25  95  47  67  30  70 
Red   19  27  8  29  16  21  11  28 

139 

57 

179  129 
62  44 

44  24  19  45 
17  9   8  23 

9i 
4i 

2565 

772 

Total  77  108  33  124  63  88  41  98 

196 

241  173 

61  33  27  68 

132 

3337 

The  totals  2565  black  and  772  red  are  distinctly 
not  in  the  ratio  3:1,  which  is  ascribed  to  linkage. 
The  question  before  us  is  whether  or  not  all  the  families 
indicate  the  same  ratio  between  black  and  red,  or 
whether  the  discrepancy  is  due  to  a  few  families  only. 
For   the    whole   table    %2  =35-620,    ^=32.      This    is 
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beyond  the  range  of  the  table,  so  we  apply  the  method 
explained  on  p.  61  : 

V 'zn  -  i  =7-94  ; 
Difference  =  +  *50±i. 

The  series  is  therefore  not  significantly  hetero- 
geneous ;  effectively  all  the  families  agree  and  confirm 
each  other  in  indicating  the  black-red  ratio  observed  in 
the  total. 

Exactly  the  same  procedure  would  be  adopted  if 
the  black  and  red  numbers  represented  two  samples 
distributed  according  to  some  character  or  charac- 
ters each  into  33  classes.  The  question  "  Are  these 
samples  of  the  same  population  ?  "  is  in  effect  identical 
with  the  question  "  Is  the  proportion  of  black  to  red 
the  same  in  each  family  ?  ,;  To  recognise  this  identity 
is  important,  since  it  has  been  very  widely  disregarded. 

Ex.  14.  Agreement  with  expectation  of  normal 
variance. — Closely  akin  to  tests  of  homogeneity  is  the 
use  of  the  %2  distribution  to  test  whether  or  not  an 
observed  series  of  values,  normally  or  nearly  normally 
distributed,  agrees  in  its  variance  with  expectation. 
If  xlt  x2,  .  •  .,  are  a  sample  of  a  normal  population, 
the  standard  deviation  of  which  population  is  a,  then 

1 


cr 


so-*)5 


is  distributed  in  random  samples  as  is  %2,  taking  n 
one  less  than  the  number  of  the  sample.  J.  W.  Bispham 
gives  three  series  of  experimental  values  of  the  partial 
correlation   coefficient,   which   he  assumes   should   be 
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distributed  so  that  i/o-2=29,  but  which  theoretically 
should  have  i/ct2  =28.  The  values  of  S(x-x)2  for  the 
three  samples  of  1000,  200,  100  respectively  are,  as 
judged  from  the  grouped  data, 

35-0278,     7*1071,     3*6169, 
whence  the  values  of  %2  on  the  two  theories  are 

TABLE  21 


Exp.  1. 

2. 

3- 

Total. 



Differ- 
ence. 

29  S(x  -  x)2 
28  S(x-x)2 
Expectation  (n) 

1015-81 
980-78 
999 

217-71 

210-20 
199 

104-89 
IOI-27 

99 

I338-4I 
1292-25 
1297 

51-74 
50-82 
50-92 

+  •82 
--IO 

It  will  be  seen  that  the  true  formula  for  the  variance 
gives  slightly  the  better  agreement.  That  the  differ- 
ence is  not  significant  may  be  seen  from  the  last  two 
columns.  About  6000  observations  would  be  needed 
to  discriminate  experimentally,  with  any  certainty, 
between  the  two  formulae. 


22.  Partition  of  ^2  into  its  Components 

Just  as  values  of  %2  may  be  aggregated  together  to 
make  a  more  comprehensive  test,  so  in  some  cases  it  is 
possible  to  separate  the  contributions  to  %2  made  by 
the  individual  degrees  of  freedom,  and  so  to  test  the 
separate  components  of  a  discrepancy. 

Ex.  15.  Partition  of  observed  discrepancies  from 
Mendelian  expectation.  —  The  following  table  (de 
Winton  and  Bateson's  data)  gives  the  distribution  of 
sixteen  families  of  primula  in  the  eight  classes  obtained 
from  a  back-cross  with  the  triple  recessive  : 


rt 

N 

o 

On 

■* 

On 

1^. 

oo 

o 

On 

o 

t^ 

VO 

On 

t^ 

■* 

oo 

04 

01 

<* 

1-1 

u-> 

00 

O 

N 

N 

M 

N 

o 

ro 

NO 

On 

O 

<N 

LO 

ro 

M 

ro 

f^ 

On 

On 

ro 

NO 

00 

NO 

lO 

o 

in 

M 

ro 

On 

M 

O 

!_) 

<t 

«t 

Ti- 

t^ 

CS 

-«* 

<* 

ro 

O 

00 

oo 

00 

On 

*■ 

ro 

ro 

ro 

00 

<N 

" 

ro 

>J 

NO 

-+ 

ro 

N 

ro 

>* 

M 

u-> 

n 

On 

On 

ro 

VO 

LO 

ro 

00 

1^ 

* 

* 

O 

o 

00 

04 

ON 

CO 

LO 

vO 

ro 

to 

in 

VO 

* 

NO 

OO 

In 

1-1 

**■ 

NO 

rJ 

<* 

ro 

On 

On 

O 

NO 

M 

ro 

VO 

o 

N 

<s 

ro 

r< 

u~> 

s 

•"* 

—> 

3 

fc 

NO 

>> 

c< 

O 

t^ 

00 

00 

M 

ro 

N 

* 

ro 

1-1 

On 

On 

O 

SO 

On 

NO 

* 

00 

O 

C4 

£ 

1-1 

l-l 

ON 

o 

ro 

d 

r-» 

VO 

O 

(O 

U") 

NO 

ro 

oo 

00 

« 

" 

M 

M 

" 

t«» 

On 

u-> 

i^. 

N 

o 

*■ 

ro 

lO 

N 

t^ 

o 

ro 

«o 

o 

~ 

<s 

" 

" 

On 

M 

C4 

M 

ro 

o 

On 

NO 

NO 

On 

8 

u-. 

M 

u-i 

ro 

^ 

t^ 

o 

oo 

N 

NO 

M 

00 

lO 

" 

N 

0) 

in 

00 

ro 

O 

r^ 

N 

NO 

M 

N 

ON 

NO 
OO 

LT> 

~ 

~ 

M 

2 

1^ 

LO 

o 

* 

On 

ro 

*■ 

o 

t^ 

M 

00 

LO 

l^ 

ON 

J 

* 

bo 

0J3 

b£ 

*3 
o 
H 

M 

X 

£■ 

43 

U 

43 

43 

U 

■fl 

43 
o 

-8 

•8 

92 


GOODNESS  OF  FIT,  INDEPENDENCE,  ETC.     93 

The  theoretical  expectation  is  that  the  eight  classes 
should  appear  in  equal  numbers,  corresponding  to  the 
hypothesis  that  in  each  factor  the  allelomorphs  occur 
with  equal  frequency,  and  that  the  three  factors  are 
unlinked.  This  expectation  is  fairly  realised  in  the 
totals  of  the  sixteen  families,  but  the  individual 
families  are  somewhat  irregular.  The  values  of  x2 
obtained  by  comparing  each  family  with  expectation 
are  given  in  the  lowest  line.  These  values  each 
correspond  to  seven  degrees  of  freedom,  and  it  appears 
that  in  5  cases  out  of  16,  P  is  less  than  -i,  and  of  these 
2  are  less  than  -02.  This  confirms  the  impression  of 
irregularity,  and  the  total  value  of  x2  (not  to  De  con- 
fused with  x2  derived  from  the  totals),  which  corre- 
sponds to  112  degrees  of  freedom,  is  151-78. 

Now  V227=  14-93  ; 

V303'56  =  17*42  ; 
Difference  =  +2-49  ; 

so  that,  judged  by  the  total  ^2,  the  evidence  for 
departures  from  expectation  in  individual  families  is 
clear. 

Each  family  is  free  to  differ  from  expectation 
in  seven  independent  ways.  To  carry  the  analysis 
further,  we  must  separate  the  contribution  to  x2  °f 
each  of  these  seven  degrees  of  freedom.  Mathe- 
matically the  subdivision  may  be  carried  out  in  more 
than  one  way,  but  the  only  way  which  appears  to  be  of 
biological  interest  is  that  which  separates  the  parts  due 
to  inequality  of  the  allelomorphs  of  the  three  factors,  and 
the  three  possible  linkage  connexions.     If  we  separate 
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the    frequencies    into    positive    and    negative    values 
according  to  the  following  seven  ways, 


TABLE  23 


1 

Ch. 

G. 

W. 

G  W. 

ChW. 

ChG. 

Ch  G  W. 

ChGW  . 

+ 

+ 

+ 

.      + 

+ 

+ 

+ 

ChGw   . 

+ 

+ 

- 

- 

- 

+ 

- 

ChgW   . 

+ 

- 

+ 

- 

+ 

- 

- 

Ch  g  w     . 

+ 

- 

- 

+ 

- 

- 

+ 

chGW    . 

- 

+ 

+ 

+ 

- 

- 

- 

ch  G  w     . 

- 

+ 

- 

- 

+ 

- 

+ 

chgW    . 

- 

- 

+ 

- 

- 

+ 

+ 

chg  w      . 

— 

— 

~ 

+ 

+ 

+ 

— 

then  it  will  be  seen  that  all  seven  subdivisions  are 
wholly  independent,  since  any  two  of  them  agree  in 
four  signs  and  disagree  in  four.  The  first  three 
degrees  of  freedom  represent  the  inequalities  in  the 
allelomorphs  of  the  three  factors  Ch,  G,  and  W  ;  the 
next  are  the  degrees  of  freedom  involved  in  an  enquiry 
into  the  linkage  of  the  three  pairs  of  factors,  while  the 
seventh  degree  of  freedom  has  no  simple  biological 
meaning  but  is  necessary  to  complete  the  analysis. 
If  we  take  in  the  first  family,  for  example,  the  differ- 
ence between  the  numbers  of  the  W  and  w  plants, 
namely  8,  then  the  contribution  of  this  degree  of 
freedom  to  x2  1S  found  by  squaring  the  difference 
and  dividing  by  the  number  in  the  family,  e.g. 
82  -?■  72  -  -889.  In  this  way  the  contribution  of 
each  of  the  112  degrees  of  freedom  in  the  sixteen 
families  is  found  separately,  as  shown  in  the  following 
table  : 
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TABLE  24 


Family. 

Ch. 

G. 

w. 

G  W. 

Ch  W. 

ChG. 

ChG  W. 

Total. 

54 

3'556 

2- OOO 

•889 

•222 

2 -OOO 

•889 

•222 

9-778 

55 

•076 

3-034 

•076 

3-034 

•412 

1-017 

•2IO 

7-859 

58 

•820 

•820 

•820 

•295 

1-607 

•820 

•295 

5-477 

59 

•153 

•831 

4-898 

•OI7 

6-II9 

•831 

•153 

13-002 

107 

6720 

•269 

3-108 

I-8I7 

•O97 

•269 

•269 

12-549 

no 

14-821 

1-282 

•821 

•821 

•205 

1-282 

O 

19-232 

119 

6.261 

•391 

•39i 

•174 

2-I30 

•043 

•696 

10-086 

121 

1 1  000 

O 

0 

•364 

•8l8 

•091 

•091 

12-364 

122 

•161 

6-200 

1-090 

1-865 

•523 

•316 

7-903 

18-058 

127 

•610 

•024 

•220 

•6lO 

1*195 

•220 

1-976 

4-855 

129 

•900 

I- 600 

0 

•4OO 

•IOO 

•900 

•9OO 

4-800 

131 

•172 

•062 

•062 

•062 

•062 

•338 

8-448 

9-206 

132 

•163 

•791 

.320 

•32O 

•059 

1-471 

•059 

3-183 

133 

•220 

•220 

4122 

•O24 

8-805 

•220 

•6lO 

14-221 

135 

•211 

3-368 

1-316 

•053 

•053 

0 

•053 

5-054 

178 

•258 

•835 

•093 

•O93 

•OIO 

•258 

•505 

2-052 

Total 

46-I02 

21-727 

18-226 

IO-I7I 

24'195 

8-965 

22-390 

151-776 

Looking  at  the  total  values  of  x2  f°r  each  column, 
since  n  is  16  for  these,  we  see  that  all  except  the  first 
have  values  of  P  between  -05  and  -95,  while  the  contri- 
bution of  the  first  degree  of  freedom  is  very  clearly 
significant.  It  appears  then  that  the  greater  part, 
if  not  the  whole,  of  the  discrepancy  is  ascribable  to 
the  behaviour  of  the  Sinensis-Stellata  factor,  and  its 
behaviour  strongly  suggests  close  linkage  with  a 
recessive  lethal  gene  of  one  of  the  familiar  types.  In 
four  families,  1 07-1 21,  the  only  high  contribution  is  in 
the  first  column.  If  these  four  families  are  excluded 
X2  =97-545,  and  this  exceeds  the  expectation  or 
n  =  84  by  only  just  over  the  standard  error  ;  the  total 
discrepancy  cannot  therefore  be  regarded  as  significant. 
There  does,  however,  appear  to  be  an  excess  of  very 
large  entries,  and  it  is  noticeable  of  the  seven  largest, 
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TABLE  III— 


n. 

p=-99. 

•98. 

* 

•90. 

•80. 

•70. 

I 

•000157 

•000628 

•00393 

•0158 

•0642 

•148 

2 

•0201 

•0404 

•103 

•211 

•446 

•713 

3 

•115 

-185 

•352 

•584 

1-005 

1*424 

4 

•297 

•429 

•711 

1-064 

1-649 

2-195 

5 

•554 

•752 

i*i45 

i-6io 

2-343 

3-000 

6 

•872 

i*i34 

i*635 

2-204 

3-070 

3*828 

7 

1-239 

1-564 

2-167 

2-833 

3-822 

4-671 

8 

1-646 

2-032 

2-733 

3-490 

4-594 

5*527 

9 

2-088 

2-532 

3-325 

4.168 

5*38o 

6*393 

10 

2-558 

3-059 

3-940 

4-865 

6-179 

7-267 

ii 

3-053 

3-609 

4-575 

5-578 

6-989 

8-148 

12 

3*571 

4-178 

5-226 

6-304 

7-807 

9*034 

13 

4- 107 

4-765 

5-892 

7-042 

8-634 

9-926 

14 

4' 660 

5-368 

6-571 

7.790 

9-467 

10-821 

15 

5-229 

5-985 

7-261 

8-547 

10-307 

1 1-721 

16 

5-812 

6-614 

7-962 

9-312 

1 1-152 

12-624 

17 

6-408 

7-255 

8-672 

10-085 

12-002 

i3*53i 

18 

7-015 

7-906 

9.390 

10-865 

12-857 

14-440 

19 

7-633 

8-567 

10-117 

11-651 

13-716 

15*352 

20 

8-260 

9-237 

10-851 

12-443 

14-578 

16-266 

21 

8-897 

9-915 

11-591 

13-240 

15*445 

17-182 

22 

9-542 

10-600 

12-338 

14-041 

16-314 

18-101 

23 

10-196 

11-293 

13-091 

14-848 

17-187 

19-021 

24 

10-856 

11-992 

13*848 

15-659 

18-062 

19*943 

25 

11-524 

12-697 

14-611 

16-473 

18-940 

20-867 

26 

12-198 

13-409 

15*379 

17-292 

19-820 

21-792 

27 

12-879 

14-125 

16-151 

18-114 

20-703 

22-719 

28 

I3*565 

14-847 

16-928 

18.939 

21-588 

23*647 

29 

14-256 

15-574 

17-708 

19.768 

22-475 

24*577 

30 

H-953 

16-306 

18-493 

20-599 

23.364 

25-508 

For  larger  values  of  n,  the  expression  -\/2X2  ~  V^w  -  1 
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Table  of  x2 


•50. 

•30. 

•20. 

•10. 

•05. 

•02. 

•01. 

•455 

1-074 

1-642 

2-706 

3-841 

5-412 

6-635 

1-386 

2-408 

3-219 

4-605 

5-991 

7-824 

9-210 

2-366 

3-665 

4-642 

6-251 

7-8i5 

9-837 

11-341 

3*357 

4-878 

5-989 

7.779 

9-488 

n-668 

13.277 

4-351 

6-064 

7-289 

9-236 

11-070 

13-388 

15-086 

5-348 

7-231 

8-558 

10-645 

12-592 

15*033 

16-812 

6-346 

8-383 

9-803 

12-017 

14-067 

16-622 

18-475 

7*344 

9-524 

11-030 

13-362 

I5-507 

18-168 

20-090 

8-343 

10-656 

12-242 

14-684 

16-919 

19-679 

21-666 

9*342 

11-781 

13-442 

15-987 

18-307 

21-161 

23-209 

10-341 

12-899 

14-631 

17-275 

19-675 

22-618 

24*725 

11-340 

14-011 

15-812 

18-549 

21-026 

24-054 

26-217 

12-340 

15-119 

16-985 

19-812 

22-362 

25-472 

27-688 

13*339 

l6-222 

18-151 

21-064 

23-685 

26-873 

29-141 

14*339 

I7-322 

19-311 

22-307 

24-996 

28-259 

30-578 

I5-338 

I8-4I8 

20-465 

23-542 

26-296 

29-633 

32-000 

16-338 

I9-5II 

21-615 

24-769 

27-587 

30-995 

33-409 

17-338 

20-601 

22-760 

25-989 

28-869 

32-346 

34*805 

18-338 

21-689 

23-900 

27-204 

30-144 

33-687 

36-191 

19*337 

22-775 

25-038 

28-412 

31-410 

35-020 

37-566 

20-337 

23-858 

26-171 

29-615 

32-671 

36-343 

38-932 

21-337 

24-939 

27-301 

30-813 

33-924 

37-659 

40-289 

22-337 

26-018 

28-429 

32-007 

35-172 

38-968 

41-638 

23*337 

27-096 

29*553 

33-196 

36-415 

40-270 

42-980 

24*337 

28-172 

30-675 

34-382 

37-652 

41-566 

44-3I4 

25-336 

29-246 

31*795 

35-563 

38-885 

42-856 

45-642 

26-336 

30-319 

32-912 

36-741 

40-113 

44-140 

46-963 

27-336 

3I-39I 

34-027 

37-916 

41*337 

45-419 

48-278 

28-336 

32-461 

35-139 

39*087 

42-557 

46-693 

49-588 

29-336 

33-530 

36-250 

40-256 

43*773 

47-962 

-» 

50-892 

may  be  used  as  a  normal  deviate  with  unit  standard  error. 
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six  appear  in  pairs  belonging  to  the  same  family.  The 
distribution  of  the  remaining  12  families  according  to 
the  value  of  P  is  as  follows  : 

TABLE  25 


p  .    . 

i-o 

•9 

•8 

7      '5 

•3 

•2 

•1 

•05 

•02 

•01      0 

Total 

Families . 

1 

' 

0 

4 

• 

2 

0 

1 

1 

I 

0 

12 

from  which  it  would  appear  that  there  is  some  slight 
evidence  of  an  excess  of  families  with  high  values  of 
X2.  This  effect,  like  other  non-significant  effects,  is 
only  worth  further  discussion  in  connexion  with  some 
plausible  hypothesis  capable  of  explaining  it. 

The  general  procedure  to  follow  in  analysing  %2 
into  its  components  will  be  developed  in  Section  55. 


N.B.— Table  of  x2,  p.  96. 


TESTS  OF  SIGNIFICANCE  OF  MEANS,  DIFFER- 
ENCES OF  MEANS,  AND  REGRESSION  CO- 
EFFICIENTS 

23.  The  Standard  Error  of  the  Mean 

The  fundamental  proposition  upon  which  the  statis- 
tical treatment  of  mean  values  is  based  is  that — If  a 
quantity  be  normally  distributed  with  standard  devia- 
tion a,  then  the  mean  of  a  random  sample  of  n  such 
quantities  is  normally  distributed  with  standard 
deviation  a/Vn. 

The  utility  of  this  proposition  is  somewhat  increased 
by  the  fact  that  even  if  the  original  distribution  were 
not  exactly  normal,  that  of  the  mean  usually  tends  to 
normality,  as  the  size  of  the  sample  is  increased ;  the 
method  is  therefore  applied  widely  and  legitimately  to 
cases  in  which  we  have  not  sufficient  evidence  to  assert 
that  the  original  distribution  was  normal,  but  in  which 
we  have  reason  to  think  that  it  does  not  belong  to  the 
exceptional  class  of  distributions  for  which  the  distri- 
bution of  the  mean  does  not  tend  to  normality. 

If,  therefore,  we  know  the  standard  deviation  of  a 
population,  we  can  calculate  the  standard  deviation  of 
the  mean  of  a  random  sample  of  any  size,  and  so  test 
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whether  or  not  it  differs  significantly  from  any  fixed 
value.  If  the  difference  is  many  times  greater  than 
the  standard  error,  it  is  certainly  significant,  and  it  is  a 
convenient  convention  to  take  twice  the  standard  error 
as  the  limit  of  significance  ;  this  is  roughly  equivalent 
to  the  corresponding  limit  P  =  -05,  already  used  for  the 
X2  distribution.  The  deviations  in  the  normal  distri- 
bution corresponding  to  a  number  of  values  of  P  are 
given  in  the  lowest  line  of  the  table  of  t  at  the  end  of 
this  chapter  (p.  139).  More  detailed  information  has 
been  given  in  Table  I. 

Ex.  16.  Significance  of  mean  of  a  large  sample. — 
We  may  consider  from  this  point  of  view  Weldon's 
die-casting  experiment  (Ex.  5,  p.  64).  The  variable 
quantity  is  the  number  of  dice  scoring  "  5  "  or  "  6  " 
in  a  throw  of  12  dice.  In  the  experiment  this  number 
varies  from  zero  to  eleven,  with  an  observed  mean  of 
4-0524  ;  the  expected  mean,  on  the  hypothesis  that 
the  dice  were  true,  is  4,  so  that  the  deviation  observed 
is  -0524.  If  now  we  estimate  the  variance  of  the  whole 
sample  of  26,306  values  as  explained  in  Ex.  2,  but 
without  using  Sheppard's  correction  (for  the  data  are 
not  grouped,  and  even  with  grouped  data,  since  the 
mean  is  affected  by  grouping  errors,  its  variance 
should  be  estimated  without  this  correction),  we  find 

a2  =  2-69825, 

whence  a2  In  =  -ooo  1 02  5 , 

and  o/Vn  =-01013. 

The  standard  error  of  the  mean  is  therefore  about 
•01,  and  .the* observed  deviation  is  nearly  5-2  times  as 
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great  ;  thus  by  a  slightly  different  path  we  arrive 
at  the  same  conclusion  as  that  of  p.  66.  The  difference 
between  the  two  methods  is  that  our  treatment  of 
the  mean  does  not  depend  upon  the  hypothesis  that 
the  distribution  is  of  the  binomial  form,  but  on  the 
other  hand  we  do  assume  the  correctness  of  the  value 
of  o-  derived  from  the  observations.  This  assumption 
breaks  down  for  small  samples,  and  the  principal 
purpose  of  this  chapter  is  to  show  how  accurate 
allowance  can  be  made  in  these  tests  of  significance 
for  the  errors  in  our  estimates  of  the  standard  deviation. 

To  return  to  the  cruder  theory,  we  may  often,  as 
in  the  above  example,  wish  to  compare  the  observed 
mean  with  the  value  appropriate  to  a  hypothesis  which 
we  wish  to  test ;  but  equally  or  more  often  we  wish  to 
compare  two  experimental  values  and  to  test  their 
agreement.  In  such  cases  we  require  the  standard 
error  of  the  difference  between  two  quantities  whose 
standard  errors  are  known  ;  to  find  this  we  make  use 
of  the  proposition  that  the  variance  of  the  difference 
of  two  independent  variates  is  equal  to  the  sum  of  their 
variances.  Thus,  if  the  standard  deviations  are  a1} 
<j2,  the  variances  are  gx2  and  a22  ;  consequently  the 
variance  of  the  difference  is  of  +ct22,  and  the  standard 
error  of  the  difference  is  V°i2  +o-22. 

Ex.  17.  Standard  error  of  difference  of  means 
from  large  samples. — In  Table  2  is  given  the  distribu- 
tion in  stature  of  a  group  of  men,  and  also  of  a  group  of 
women  ;  the  means  are  68-64  and  63-87  inches,  giving 
a  difference  of  4-77  inches.  The  variance  obtained 
for  the  men  was   7-2964  square  inches  ;    this  is  the 
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value  obtained  by  dividing  the  sum  of  the  squares  of 
the  deviations  by  1164  ;  if  we  had  divided  by  1163,  to 
make  the  method  comparable  to  that  appropriate  to 
small  samples,  we  should  have  found  7-3027.  Divid- 
ing this  by  1 164,  we  find  the  variance  of  the  mean 
is  -006274.  Similarly  the  variance  for  the  women  is 
6-6991,  which  divided  by  1456  gives  the  variance  of 
the  mean  of  the  women  as  -004601.  To  find  the 
variance  of  the  difference  between  the  means,  we  must 
add  together  these  two  contributions,  and  find  in  all 
•010875  ;  the  standard  error  of  the  difference  between 
the  means  is  therefore  -1043  inches.  The  sex  difference 
in  stature  may  therefore  be  expressed  as 

4*77  ±-104  inches. 

It  is  manifest  that  this  difference  is  significant, 
the  value  found  being  over  45  times  its  standard 
error.  In  this  case  we  can  not  only  assert  a  significant 
difference,  but  place  its  value  with  some  confidence  at 
between  4 \  and  5  inches.  It  should  be  noted  that  we 
have  treated  the  two  samples  as  independent,  as  though 
they  had  been  given  by  different  authorities  ;  as  a 
matter  of  fact,  in  many  cases  brothers  and  sisters 
appeared  in  the  two  groups  ;  since  brothers  and  sisters 
tend  to  be  alike  in  stature,  we  have  overestimated  the 
probable  error  of  our  estimate  of  the  sex  difference. 
Whenever  possible,  advantage  should  be  taken  of 
such  facts  in  designing  experiments.  In  the  common 
phrase,  sisters  provide  a  better  "  control  "  for  their 
brothers  than  do  unrelated  women.  The  sex  difference 
could  therefore  be  more  accurately  estimated  from  the 
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comparison  of  each  brother  with  his  own  sister.  In 
the  following  example  (Pearson  and  Lee's  data),  taken 
from  a  correlation  table  of  stature  of  brothers  and 
sisters,  the  material  is  nearly  of  this  form  ;  it  differs 
from  it  in  that  in  some  instances  the  same  individual 
has  been  compared  with  more  than  one  sister,  or 
brother. 

Ex.  18.  Standard  error  of  mean  of  differences. — 
The  following  table  gives  the  distribution  of  the  excess 
in  stature  of  a  brother  over  his  sister  in  1401  pairs. 

TABLE  26 

Stature  difference^ 

in  inches  j"S    ~4    "3    -2    -i         o  1  2  3  4  5 

Frequency    .        .      -25   1-5   1-25  4-5   11-25  27-25  7175   12275   17*75  20975  220-5 

Stature  difference  1      ,  _  ,    _      , 

in  inches  /     6  7  8         9      10       n       12     13    14   15     16    Total 

Frequency    .        .    205-5   14^75  9575    57     26     11-25  85  275   1      1      -75     1401 

Treating  this  distribution  as  before  we  obtain  : 
mean  =4-895,  variance  =6-4074,  variance  of  mean 
=  •004573,  standard  error  of  mean  =-0676;  showing 
that  we  may  estimate  the  mean  sex  difference  as  4!  to 
5  inches. 

In  the  above  examples,  which  are  typical  of  the 
use  of  the  standard  error  applied  to  mean  values,  we 
have  assumed  that  the  variance  of  the  population  is 
known  with  exactitude.  It  was  pointed  out  by 
"  Student  "  in  1908,  that  with  small  samples,  such  as 
are  of  necessity  usual  in  field  and  laboratory  experi- 
ments, the  variance  of  the  population  can  only  be 
roughly  estimated  from  the  sample,  and  that  the  errors 
of  estimation  seriously  affect  the  use  of  the  standard 
error. 
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If  x  (for  example  the  mean  of  a  sample)  is  a  value 
with  normal  distribution  and  a  is  its  true  standard 
error,  then  the  probability  that  xjv  exceeds  any  speci- 
fied value  may  be  obtained  from  the  appropriate  table 
of  the  normal  distribution  ;  but  if  we  do  not  know  a, 
but  in  its  place  have  s,  an  estimate  of  the  value  of  cr, 
the  distribution  required  will  be  that  of  xjs,  and  this 
is  not  normal.  The  true  value  has  been  divided  by  a 
factor,  s/a,  which  introduces  an  error.  We  have  seen 
in  the  last  chapter  that  the  distribution  in  random 
samples  of  s2/a2  is  that  of  %2/^,  when  n  is  equal  to  the 
number  of  degrees  of  freedom,  of  the  group  (or  groups) 
of  which  s2  is  the  mean  square  deviation.  Conse- 
quently the  distribution  of  sjv  is  calculable,  and  if 
its  variation  is  completely  independent  of  that  of  x\a 
(as  in  the  cases  to  which  this  method  is  applicable), 
then  the  true  distribution  of  xjs  can  be  calculated, 
and  accurate  allowance  made  for  its  departure  from 
normality.  The  only  modification^  required  in  these 
cases  depends  solely  on  the  number  n,  representing 
the  number  of  degrees  of  freedom  available  for  the 
estimation  of  a.  The  necessary  distributions  were 
given  by  "  Student  "  in  1908  ;  fuller  tables  have  since 
been  given  by  the  same  author,  and  at  the  end  of  this 
chapter  (p.  139)  we  give  the  distributions  in  a  similar 
form  to  that  used  for  our  table  of  ^2. 

24.  The  Significance  of  the  Mean  of  a  Unique  Sample 

If  x1}  x2,  .  •  .,  xw  is  a  sample  of  n  values  of  a 
variate,  x,  and  if  this  sample  constitutes  the  whole  of 
the  information  available  on   the  point  in  question, 


SIGNIFICANCE  OF  MEANS,  ETC.  105 

then    we    may    test    whether   the    mean    of  x   differs 
significantly  from  zero  by  calculating  the  statistics 

x  =  -,S(x), 
n 

s2  I 

so-*)2, 


ri(ri  - 1) 
_x\n 

T  == 


n  =n  -  1. 

The  distribution  of /for  random  samples  of  a  normal 
population  distributed  about  zero  as  mean  is  given  in 
the  table  of  t  for  each  value  of  n.  The  successive 
columns  show,  for  each  value  of  n,  the  values  of  t  for 
which  P,  the  probability  of  falling  outside  the  range 
±ty  takes  the  values  -9,  .  .  .,  -oi,  at  the  head  of  the 
columns.  Thus  the  last  column  shows  that,  when 
n  =  10,  just  1  per  cent  of  such  random  samples  will  give 
values  of  t  exceeding  +3-169,  or  less  than  -3-169.  If 
it  is  proposed  to  consider  the  chance  of  exceeding  the 
given  values  of  t,  in  a  positive  (or  negative)  direction 
only,  then  the  values  of  P  should  be  halved.  It  will 
be  seen  from  the  table  that  for  any  degree  of  certainty 
we  require  higher  values  of  t,  the  smaller  the  value  of 
n.  The  bottom  line  of  the  table,  corresponding  to 
infinite  values  of  n}  gives  the  values  of  a  normally 
distributed  variate,  in  terms  of  its  standard  deviation, 
for  the  same  values  of  P. 

Ex.  19.  Significance  of  mean  of  a  small  sample. — 
The  following  figures  (Cushny  and  Peebles'  data), 
which  I  quote  from  Student's  paper,  show  the  result 
of  an  experiment  with  ten  patients,  on  the  effect  of 
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the  optical  isomers  of  hyoscyamine  hydrobromide  in 
producing  sleep. 

TABLE  27 

Additional  Hours  of  Sleep  gained  by  the  Use 
of  Hyoscyamine  Hydrobromide 


Patient. 

1  (Dextro-). 

2  (Laevo- ). 

Difference  (2—  1). 

, 

+  0*7 

+  i-9 

+  1-2 

2 

-1-6 

+  o-8 

+  2-4 

3 

-C2 

+  i-i 

+  i-3 

4 

-  1-2 

+  o-i 

+  i'3 

5 

-o-i 

-OT 

o-o 

6 

+  3'4 

+  4'4 

+  i-o 

7 

+  3*7 

+  5'5 

+  i-8 

8 

+  o-8 

+  1-6 

+  o-8 

9 

o-o 

+  4*6 

+  4*6 

10 

H-2-O 

+  3-4 

+  i-4 

Mean  (x) 

+  •75 

+  2-33 

+  1-58 

The  last  column  gives  a  controlled  comparison  of 
the  efficacy  of  the  two  drugs  as  soporifics,  for  the  same 
patients  were  used  to  test  each  ;  from  the  series  of 
differences  we  find 

x  =  +  1-58, 

s* 

io  =  '1513' 

slVio='2,Sgo, 
t  =4-06. 

For  n  =9,  only  one  value  in  a  hundred  will  exceed 
3-250  by  chance,  so  that  the  difference  between  the 
results  is  clearly  significant.  By  the  methods  of  the 
previous  chapters  we  should,  in  this  case,  have  been  led 
to  the  same  conclusion  with  almost  equal  certainty  ; 
for  if  the  two  drugs  had  been  equally  effective,  positive 
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and  negative  signs  would  occur  in  the  last  column  with 
equal  frequency.  Of  the  9  values  other  than  zero, 
however,  all  are  positive,  and  it  appears  from  the 
binomial  distribution, 

(i+4)9. 

that  all  will  be  of  the  same  sign,  by  chance,  only  twice 
in  512  trials.  The  method  of  the  present  chapter 
differs  from  that  in  taking  account  of  the  actual  values 
and  not  merely  of  their  signs,  and  is  consequently 
the  more  reliable  method  when  the  actual  values  are 
available. 

241.  Comparison  of  Two  Means 

In  experimental  work  it  is  even  more  frequently 
necesssary  to  test  whether  two  samples  differ  signi- 
ficantly in  their  means,  or  whether  they  may  be 
regarded  as  belonging  to  the  same  population.  In  the 
latter  case  any  difference  in  treatment  which  they  may 
have  received  will  have  shown  no  significant  effect. 

It  Xi,  x2f  .  •  .,  xMi+i  and  x  1}  x  2,  •  •  •  >  x  n%+1  be 
two  samples,  the  significance  of  the  difference  between 
their  means  may  be  tested  by  calculating  the  following 
statistics  : 

*  =  -i_S(*),    *-■ - —  S(*'). 

n1  +  1  n2  +  1 

\n1  +  1     n2  +  i/ 

= (nx+n2  +2) ,        _  _    ,    > 

0^  +  iX^  +  iX^+O1   ^         J         ^  >* 

s     V       nx  +  n2  +  2     ' 
n  =ny  +n9. 
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The  means  are  calculated  as  usual  ;  the  standard 
deviation  is  estimated  by  pooling  the  sums  of  squares 
from  the  two  samples  and  dividing  by  the  total  number 
of  the  degrees  of  freedom  contributed  by  them  ;  if  o- 
were  the  true  standard  deviation,  the  variance  of  the 
first  mean  would  be  a2j{n1  +  i)}  of  the  second  mean 
<j2l(n2  +  i),  and  therefore  of  the  difference  o2{i/(^!  +  i)  + 
il(n2  +  i)}  ;  t  is  therefore  found  by  dividing  x  -  x  by 
its  standard  error  as  estimated,  and  the  error  of  the 
estimation  is  allowed  for  by  entering  the  table  with  n 
equal  to  the  number  of  degrees  of  freedom  available  for 
estimating  s  ;  that  is  n  =nx  +  n2.  It  is  thus  possible  to 
extend  Student's  treatment  of  the  error  of  a  mean  to 
the  comparison  of  the  means  of  two  samples. 

It  may  be  noted  in  connexion  with  this  method, 
and  with  later  developments,  which  also  involve  a 
pooled  estimate  of  the  variance,  that  a  difference  in 
variance  between  the  populations  from  which  the 
samples  are  drawn  will  tend  somewhat  to  enhance  the 
value  of  /  obtained.  The  test,  therefore,  is  decisive, 
if  the  value  of  /  is  significant,  in  showing  that  the 
samples  could  not  have  been  drawn  from  the  same 
population  ;  but  it  might  conceivably  be  claimed 
that  the  difference  indicated  lay  in  the  variances  and 
not  in  the  means.  The  theoretical  possibility,  that  a 
significant  value  of  t  should  be  produced  by  a  differ- 
ence between  the  variances  only,  seems  to  be  unim- 
portant in  the  application  of  the  method  to  experimental 
data ;  as  a  supplementary  test,  however,  the  signific- 
ance of  the  difference  between  the  variances  may  be 
tested  directly  by  the  method  of  §  41. 
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Ex.  20.  Significance  of  difference  of  means  of 
small  samples. — Let  us  suppose  that  the  above  figures 
(Table  27)  had  been  obtained  using  different  patients 
for  the  two  drugs  ;  the  experiment  would  have  been 
less  well  controlled,  and  we  should  expect  to  obtain 
less  certain  results  from  the  same  number  of  observa- 
tions, for  it  is  a  priori  probable,  and  the  above  figures 
suggest,  that  personal  variations  in  response  to  the 
drugs  will  be  to  some  extent  correlated. 

Taking,  then,  the  figures  to  represent  two  different 
sets  of  patients,  we  have 

x  -x'  =  +1-58, 
*2(tV+ tit)  ='72I3, 

t=  4-  I*86l, 

»  =  i8. 

The  value  of  P  is,  therefore,  between  •  1  and  -05,  and 
cannot  be  regarded  as  significant.  This  example 
shows  clearly  the  value  of  design  in  small  scale  experi- 
ments, and  that  the  efficacy  of  such  design  is  capable 
of  statistical  measurement. 

The  use  of  Student's  distribution  enables  us  to 
appreciate  the  value  of  observing  a  sufficient  number 
of  parallel  cases  ;  their  value  lies,  not  only  in  the  fact 
that  the  probable  error  of  a  mean  decreases  inversely 
as  the  square  root  of  the  number  of  parallels,  but  in 
the  fact  that  the  accuracy  of  our  estimate  of  the 
probable  error  increases  simultaneously.  The  need 
for  duplicate  experiments  is  sufficiently  widely  realised  ; 
it  is  not  so  widely  understood  that  in  some  cases,  when 
it  is  desired  to  place  a  high  degree  of  confidence  (say 


no  STATISTICAL  METHODS 

P  =  -oi)  on  the  results,  triplicate  experiments  will 
enable  us  to  detect  with  confidence  differences  as 
small  as  one-seventh  of  those  which,  with  a  duplicate 
experiment,  would  justify  the  same  degree  of  confi- 
dence. 

The  confidence  to  be  placed  in  a  result  depends  not 
only  on  the  magnitude  of  the  mean  value  obtained, 
but  equally  on  the  agreement  between  parallel  experi- 
ments. Thus,  if  in  an  agricultural  experiment  a  first 
trial  shows  an  apparent  advantage  of  8  tons  to  the 
acre,  and  a  duplicate  experiment  shows  an  advantage 
of  9  tons,  we  have  n  =  i,  t  =  17,  and  the  results  would 
justify  some  confidence  that  a  real  effect  had  been 
observed  ;  but  if  the  second  experiment  had  shown  an 
apparent  advantage  of  18  tons,  although  the  mean  is 
now  higher,  we  should  place  not  more  but  less 
confidence  in  the  conclusion  that  the  treatment  was 
beneficial,  for  t  has  fallen  to  26,  a  value  which  for 
n  =  1  is  often  exceeded  by  chance.  The  apparent 
paradox  may  be  explained  by  pointing  out  that  the 
difference  of  10  tons  between  the  experiments  indicates 
the  existence  of  uncontrolled  circumstances  so  in- 
fluential that  in  both  cases  the  apparent  benefit  may 
be  due  to  chance,  whereas  in  the  former  case  the 
relatively  close  agreement  of  the  results  suggests  that 
the  uncontrolled  factors  are  not  so  very  influential. 
Much  of  the  advantage  of  further  replication  lies  in 
the  fact  that  with  duplicates  our  estimate  of  the 
importance  of  the  uncontrolled  factors  is  so  extremely 
hazardous. 

In  cases  in  which  each  observation  of  one  series 
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in 


corresponds  in  some  respects  to  a  particular  observa- 
tion of  the  second  series,  it  is  always  legitimate  to  take 
the  differences  and  test  them  as  in  Ex.  18  or  19,  as 
a  single  sample ;  but  it  is  not  always  desirable  to  do 
so.  A  more  precise  comparison  is  obtainable  by  this 
method  only  if  the  corresponding  values  of  the  two 
series  are  positively  correlated,  and  only  if  they  are 
correlated  to  a  sufficient  extent  to  counterbalance  the 
loss  of  precision  due  to  basing  our  estimate  of  variance 
upon  fewer  degrees  of  freedom.  An  example  will 
make  this  plain. 

Ex.  21.  Significance  of  change  in  bacterial 
numbers.  —  The  following  table  shows  the  mean 
number  of  bacterial  colonies  per  plate  obtained  by 
four  slightly  different  methods  from  soil  samples 
taken  at  4  p.m.  and  8  p.m.  respectively  (H.  G.  Thorn- 
ton's data)  : 

TABLE  28 


Method. 

4   P.M. 

8  P.M. 

Difference. 

A 

B 

C 

D 

29-75 
27-50 
30-25 
27-80 

39- 20 
40-60 
36-20 
42-40 

+  9*45 
+  13-10 

+  5*95 
+  14-60 

Mean 

28-825 

39-60 

+  IO-775 

From  the  series  of  differences  we  have  x  =  +  10-775, 
ij-2  =  3-756,  /  =  5-560,  n=$,  whence  the  table  shows 
that  P  is  between  -oi  and  -02.  If,  on  the  contrary,  we 
use  the  method  of  Ex.  20,  and  treat  the  two  separate 
series,  we  find  x-x  =  +10-775,  2^2  =2-188,  ^  =  7-285, 
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n  =  6  ;  this  is  not  only  a  larger  value  of  n  but  a  larger 
value  of  t,  which  is  now  far  beyond  the  range  of  the 
table,  showing  that  P  is  extremely  small.  In  this 
case  the  differential  effects  of  the  different  methods 
are  either  negligible,  or  have  acted  quite  differently 
in  the  two  series,  so  that  precision  was  lost  in  compar- 
ing each  value  with  its  counterpart  in  the  other  series. 
In  cases  like  this  it  sometimes  occurs  that  one  method 
shows  no  significant  difference,  while  the  other  brings  it 
out ;  if  either  method  indicates  a  definitely  significant 
difference,  its  testimony  cannot  be  ignored,  even  if 
the  other  method  fails  to  show  the  effect.  When  no 
correspondence  exists  between  the  members  of  one 
series  and  those  of  the  other,  the  second  method  only 
is  available. 

25.  Regression  Coefficients 

The  methods  of  this  chapter  are  applicable  not 
only  to  mean  values,  in  the  strict  sense  of  the  word, 
but  to  the  very  wide  class  of  statistics  known  as 
regression  coefficients.  The  idea  of  regression  is 
usually  introduced  in  connection  with  the  theory  of 
correlation,  but  it  is  in  reality  a  more  general,  and,  in 
some  respects,  a  simpler  idea,  and  the  regression  co- 
efficients are  of  interest  and  scientific  importance  in 
many  classes  of  data  where  the  correlation  coefficient, 
if  used  at  all,  is  an  artificial  concept  of  no  real  utility. 
The  following  qualitative  examples  are  intended  to 
familiarise  the  student  with  the  concept  of  regression, 
and  to  prepare  the  way  for  the  accurate  treatment  of 
numerical  examples. 
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It  is  a  commonplace  that  the  height  of  a  child 
depends  on  his  age,  although,  knowing  his  age,  we 
cannot  accurately  calculate  his  height.  At  each  age 
the  heights  are  scattered  over  a  considerable  range  in 
a  frequency  distribution  characteristic  of  that  age  ; 
any  feature  of  this  distribution,  such  as  the  mean, 
will  be  a  continuous  function  of  age.  The  function 
which  represents  the  mean  height  at  any  age  is  termed 
the  regression  function  of  height  on  age  ;  it  is  repre- 
sented graphically  by  a  regression  curve,  or  regression 
line.  In  relation  to  such  a  regression  line  age  is 
termed  the  independent  variate,  and  height  the  de- 
pendent variate. 

The  two  variates  bear  very  different  relations  to  the 
regression  line.  If  errors  occur  in  the  heights,  this 
will  not  influence  the  regression  of  height  on  age, 
provided  that  at  all  ages  positive  and  negative  errors 
are  equally  frequent,  so  that  they  balance  in  the 
averages.  On  the  contrary,  errors  in  age  will  in 
general  alter  the  regression  of  height  on  age,  so  that 
from  a  record  with  ages  subject  to  error,  or  classified 
in  broad  age-groups,  we  should  not  obtain  the  true 
physical  relationship  between  mean  height  and  age. 
A  second  difference  should  be  noted  :  the  regression 
function  does  not  depend  on  the  frequency  distribu- 
tion of  the  independent  variate,  so  that  a  true  regression 
line  may  be  obtained  even  when  the  age  groups  are 
arbitrarily  selected,  as  when  an  investigation  deals  with 
children  of  "  school  age ".  On  the  other  hand  a 
selection  of  the  dependent  variate  will  change  the 
regression  line  altogether. 
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It  is  clear  from  the  above  instances  that  the 
regression  of  height  on  age  is  quite  different  from  the 
regression  of  age  on  height  ;  and  that  one  may  have  a 
definite  physical  meaning  in  cases  in  which  the  other 
has  only  the  conventional  meaning  given  to  it  by 
mathematical  definition.  In  certain  cases  both  regres- 
sions are  of  equal  standing  ;  thus,  if  we  express  in 
terms  of  the  height  of  the  father  the  average  adult 
height  of  sons  of  fathers  of  a  given  height,  observation 
shows  that  each  additional  inch  of  the  fathers*  height 
corresponds  to  about  half  an  inch  in  the  mean  height 
of  the  sons.  Equally,  if  we  take  the  mean  height  of 
the  fathers  of  sons  of  a  given  height,  we  find  that  each 
additional  inch  of  the  sons'  height  corresponds  to  half 
an  inch  in  the  mean  height  of  the  fathers.  No  selection 
has  been  exercised  in  the  heights  either  of  fathers 
or  of  sons ;  each  variate  is  distributed  normally,  and 
the  aggregate  of  pairs  of  values  forms  a  normal  cor- 
relation surface.  Both  regression  lines  are  straight, 
and  it  is  consequently  possible  to  express  the  facts  of 
regression  in  the  simple  rules  stated  above. 

When  the  regression  line  with  which  we  are  con- 
cerned is  straight,  or,  in  other  words,  when  the  regres- 
sion function  is  linear,  the  specification  of  regression 
is  much  simplified,  for  in  addition  to  the  general  means 
we  have  only  to  state  the  ratio  which  the  increment  of 
the  mean  of  the  dependent  variate  bears  to  the  corre- 
sponding increment  of  the  independent  variate.  Such 
ratios  are  termed  regression  coefficients.  The  regres- 
sion function  takes  the  form 

Y  =a  +6(x  -x)t 
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where    b    is    the    regression    coefficient    of  y    on    x, 

and   Y  is   the   mean   value  of  y  for  each  value  of  x. 

The  physical  dimensions  of  the  regression  coefficient 

depend  on  those  of  the  variates  ;    thus,  over  an  age 

range  in  which  growth  is  uniform  we  might  express 

the  regression  of  height  on  age  in  inches  per  annum,  in 

fact  as  an  average  growth  rate,  while  the  regression  of 

father's  height  on  son's  height  is  half  an  inch  per  inch, 

or  simply  \.      Regression  coefficients  may,  of  course, 

be  positive  or  negative. 

Curved  regression  lines  are  of  common  occurrence  ; 

in  such  cases  we  may  have  to  use  such  a  regression 

function  as 

Y  =a  +bx  +cx2  +dx3, 

in  which  all  four  coefficients  of  the  regression  function 
may,  by  an  extended  use  of  the  term,  be  called  regres- 
sion coefficients.  More  elaborate  functions  of  x  may 
be  used,  but  their  practical  employment  offers  diffi- 
culties in  cases  where  we  lack  theoretical  guidance  in 
choosing  the  form  of  the  regression  function,  and  at 
present  the  simple  power  series  (or,  polynomial  in  x) 
is  alone  in  frequent  use.  By  far  the  most  important 
case  in  statistical  practice  is  the  straight  regression 
line. 

26.  Sampling  Errors  of  Regression  Coefficients 

The  straight  regression  line  with  formula 

Y  =a  +&(x  -x) 

is  fitted  by  calculating  from  the  data,  the  two  statistics 

.         S{y(z-*)}. 
a=y>    *  =  S{(* -*)»}' 
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these  are  estimates,  derived  from  the  data,  of  the  two 
constants  necessary  to  specify  the  straight  line  ;  the 
true  regression  formula,  which  we  should  obtain  from 
an  infinity  of  observations,  may  be  represented  by 

Yw  =  a+@(x-x), 
and  the  differences  a  -  a,  b  -  (3,  are  the  errors  of  random 
sampling  of  our  statistics.  If  o-2  represent  the  variance 
of  y  for  any  value  of  x  about  a  mean  given  by  the 
above  formula,  then,  if  the  values  of  x  do  not  change 
from  sample  to  sample,  the  variance  of  a,  the  mean  of 
ri  observations,  will  be  cr2ln',  while  that  of  b,  which  is 
merely  a  weighted  mean  of  the  values  of  y  observed, 
will  be 

{S(>-*)2}2  =  S(;t:-*)2' 

It  will  be  noticed  that  the  above  value  for  the 
sampling  variance  of  a  is  not  merely  the  sampling 
variance  of  our  estimate  of  the  mean  of  y,  but  of  our 
estimate  of  the  mean  of  y  for  a  given  value  of  x,  this 
value  being  chosen  at,  or  near  to,  the  mean  of  our 
sample,  and  supposed  invariable  from  sample  to 
sample.  The  distinction,  which  at  first  sight  appears 
somewhat  subtle,  is  worth  bearing  in  mind.  From  a 
set  of  measurements  of  school  children  we  may  make 
estimates  of  the  mean  stature  at  age  ten,  and  of  the 
mean  stature  of  the  school,  and  these  estimates  will 
be  equal  if  the  mean  age  of  the  school  children  is 
exactly  ten.  Nevertheless,  the  former  will  usually  be 
the  more  accurate  estimate,  for  it  eliminates  the  varia- 
tion in  mean  school  age,  which  will  doubtless  contribute 
somewhat  to  the  variation  in  mean  school  stature. 
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In  order  to  test  the  significance  of  the  difference 
between  b,  and  any  hypothetical  value,  ft,  to  which  it 
is  to  be  compared,  we  must  estimate  the  value  of  a2  ; 
the  best  estimate  for  the  purpose  is 

,*=     *    SO>-Y)*, 


found  by  summing  the  squares  of  the  deviations  of  y 
from  its  calculated  value  Y,  and  dividing  by  {ri  -2). 
The  reason  the  divisor  is  {ri  -  2)  is  that  from  the 
ri  values  of  y  two  statistics  have  already  been  calcu- 
lated which  enter  into  the  formula  for  Y,  consequently 
the  group  of  differences,  y  -  Y,  represent  in  reality 
only  ri  -  2  degrees  of  freedom. 

When  ri  is  small,  the  estimate  of  ^2  obtained  above 
is  somewhat  uncertain,  and  in  comparing  the  difference 
b  -  0  with  its  standard  error,  in  order  to  test  its  signifi- 
cance we  shall  have  to  use  Student's  method,  with 
n=ri-2.  When  ri  is  large  this  distribution  tends 
to  the  normal  distribution.  The  value  of  /  with  which 
the  table  must  be  entered  is 


s 

Similarly,  to  test  the  significance  of  the  difference 
between  a  and  any  hypothetical  value  a,  the  table  is 
entered  with 

(a-a)\/ri 

t  =- — — ,   n=n  -2; 

s 

this  test  for  the  significance  of  a  will  be  more  sensitive 
than  that  ignoring  the  regression,  if  the  variation  in 
y  is  to  any  considerable  extent  expressible  in  terms 
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of  that  of  x,  for  the  value  of  s  obtained  from  the  regres- 
sion line  will  then  be  smaller  than  that  obtained  from 
the  original  group  of  observations.  On  the  other  hand, 
one  degree  of  freedom  is  always  lost,  so  that  if  b  is 
small,  no  greater  precision  is  obtained. 

Ex.  22.  Effect  of  nitrogenous  fertilisers  in  main- 
taining yield. — The  yields  of  dressed  grain  in  bushels 
per  acre  shown  in  Table  29  were  obtained  from  two 
plots  on  Broadbalk  wheat  field  during  thirty  years  ;  the 
only  difference  in  manurial  treatment  was  that  "  9  a  " 
received  nitrate  of  soda,  while  "  j  b  "  received  an  equi- 
valent quantity  of  nitrogen  as  sulphate  of  ammonia. 
In  the  course  of  the  experiment  plot  "  9  a  "  appears  to 
be  gaining  in  yield  on  plot  "  7  b".  Is  this  apparent 
gain  significant  ? 

A  great  part  of  the  variation  in  yield  from  year  to 
year  is  evidently  similar  in  the  two  plots  ;  in  conse- 
quence, the  series  of  differences  will  give  the  clearer 
result.  In  one  respect  the  above  data  are  especially 
simple,  for  the  thirty  values  of  the  independent  variate 
form  a  series  with  equal  intervals  between  the  successive 
values,  with  only  one  value  of  the  dependent  variate 
corresponding  to  each.  In  such  cases  the  work  is 
simplified  by  using  the  formula 

S(*-*)2  =  iW(*'2-0, 
where  n    is  the  number  of  terms,  or  30  in  this  case. 
To  evaluate  b  it  is  necessary  to  calculate 

S{y(x-x)}  ; 

this  may  be  done  in  several  ways.  We  may  multiply 
the  successive  values  of  jv  by    -29,    -27,     ...    +27, 
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+  29,  add,  and  divide  by  2.     This  is  the  direct  method 
suggested  by  the  formula.     The  same  result  is  obtained 

TABLE  29 


Harvest 
Year. 

9  a. 

lb. 

9  a-  7  b. 

1855 

29*62 

33-oo 

"3-38 

1856 

32-38 

36*91 

-4-53 

1857 

43*75 

44-84 

-  1*09 

1858 

37-56 

38-94 

-i-38 

1859 

30*00 

34*66 

-4-66 

i860 

32-62 

27-72 

+  4-90 

n'{n'*-\) 

l86l 

33*75 

34*94 

-1*19 

S(*-.?)2 

= =  2247-5 

12        J 

1862 

43*44 

35-88 

+  7-56 

1863 

55-56 

53-66 

+  1-90 

£  =  •2668 

1864 

51*06 

45-78 

+  5*28 

1865 

44*06 

40*22 

+  3-84 

S(y- 

-j7)2=  1020-56 

1866 

32-50 

29*91 

+  2-59 

62S(x- 

-J)2=  159-99 

1867 
1868 

29-13 
47-8i 

22*  l6 
39*19 

+  6*97 
+  8*62 

S(y- 

Y)2=  860-57 

1869 

39-oo 

28*25 

4- 10*75 

187O 

45-50 

41*37 

+  4-13 

s2  =  30-73 

187I 

34*44 

22*31 

+  12*13 

1872 

40*69 

29*06 

+  11*63 

s»/S(x- 

-J)8  =  -oi3675 

1873 

35-8i 

22*75 

+  13*06 

1874 

38*19 

39-56 

-1-37 

=  (-n69)2 

1875 

30-50 

26*63 

+  3-87 

1876 

33-31 

25-50 

+  7-8i 

/=  2-282 

1877 

40*12 

19*12 

+  2I-00 

1878 

37-19 

32*19 

+  5-00 

n=2S 

1879 

2i*94 

17*25 

+  4-69 

l88o 

34*06 

34-31 

-•25 

l88l 

35*44 

26*13 

+  9*31 

1882 

3i-8i 

34*75 

-2*94 

1883 

43*38 

36-31 

+  7-07 

1884 

4o*44 

37-75 

+  2*69 

Mean  . 

37-50 

33-03 

+  4*47 

by  multiplying  by  1,  2,     .  .  .,  30  and  subtracting  15! 

I 1  times  the  sum  of  values  of  y  ;    the  latter 

method  may  be  conveniently  carried  out  by  successive 
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addition.  Starting  from  the  bottom  of  the  column,  the 
successive  sums  2-69,  976,  6-82,  .  .  .  are  written  down, 
each  being  found  by  adding  a  new  value  of  y  to 
the  total  already  accumulated ;  the  sum  of  the  new 
column,  less  15 \  times  the  sum  of  the  previous  column, 
will  be  the  value  required.  In  this  case  we  find  the 
value  599-615,  and  dividing  by  2247-5,  tne  value  of 
b  is  found  to  be  -2668.  The  yield  of  plot  "  9  a  "  thus 
appears  to  have  gained  on  that  of  "  7  b  "  at  a  rate 
somewhat  over  a  quarter  of  a  bushel  per  annum. 

To  estimate  the  standard  error  of  b,  we  require  the 
value  of 

S^-Y)2; 

knowing  the  value  of  b}  it  is  easy  to  calculate  the  thirty 
values  of  Y  from  the  formula 

Y  =  y  +  (x  -  x)b  ; 

for  the  first  value,  x-x  —  -14*5,  and  the  remaining 
values  of  Y  may  be  found  in  succession  by  adding 
b  each  time.  By  subtracting  each  value  of  Y  from  the 
corresponding  y,  squaring,  and  adding,  the  required 
quantity  may  be  calculated  directly.  This  method  is 
laborious,  and  it  is  preferable  in  practice  to  utilise  the 
algebraical  fact  that 

SO  -  Y)2  =  SO  -y)2  -  62S(x  -  x)2 

=  S(y2)-?i'y2-b2S(x-x)2. 

The  work  then  consists  in  squaring  the  values  of  y  and 
adding,  then  subtracting  the  two  quantities  which  can 
be  directly  calculated  from  the  mean  value  of  y  and 
the  value  of  b.  In  using  this  shortened  method  it 
should  be  noted  that  small  errors  in  y  and  b  may  intro- 
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duce  considerable  errors  in  the  result,  so  that  it  is 
necessary  to  be  sure  that  these  are  calculated  accurately 
to  as  many  significant  figures  as  are  needed  in  the 
quantities  to  be  subtracted.  Errors  of  arithmetic 
which  would  have  little  effect  in  the  first  method,  may 
altogether  vitiate  the  results  if  the  second  method  is 
used.  The  subsequent  work  in  calculating  the  standard 
error  of  b  may  best  be  followed  in  the  scheme  given 
beside  the  table  of  data  ;  the  estimated  standard  error 
is  -1169,  so  that  in  testing  the  hypothesis  that  /3=o, 
that  is  that  plot  "  9  a  "  has  not  been  gaining  on  plot 
"  7  b  ",  we  divide  b  by  this  quantity  and  find  t  =  2-282. 
Since  s  was  found  from  28  degrees  of  freedom  n  =28, 
and  the  result  of  t  shows  that  P  is  between  -02  and  -05. 

The  result  must  be  judged  significant,  though 
barely  so  ;  in  view  of  the  data  we  cannot  ignore  the 
possibility  that  on  this  field,  and  in  conjunction  with 
the  other  manures  used,  nitrate  of  soda  has  conserved 
the  fertility  better  than  sulphate  of  ammonia  ;  these 
data  do  not,  however,  demonstrate  the  point  beyond 
possibility  of  doubt. 

The  standard  error  of  y,  calculated  from  the  above 
data,  is  1-012,  so  that  there  can  be  no  doubt  that  the 
difference  in  mean  yields  is  significant  ;  if  we  had 
tested  the  significance  of  the  mean,  without  regard  to 
the  order  of  the  values,  that  is  calculating  s2  by 
dividing  1020-56  by  29,  the  standard  error  would  have 
been  1-083.  The  value  of  b  was  therefore  high  enough 
to  have  reduced  the  standard  error.  This  suggests 
the  possibility  that  if  we  had  fitted  a  more  complex 
regression  line  to  the  data  the  probable  errors  would 
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be  further  reduced  to  an  extent  which  would  put  the 
significance  of  b  beyond  doubt.  We  shall  deal  later 
with  the  fitting  of  curved  regression  lines  to  this  type 
of  data. 

Just  as  the  method  of  comparison  of  means  is 
applicable  when  the  samples  are  of  different  sizes,  by 
obtaining  an  estimate  of  the  error  by  combining  the 
sums  of  squares  obtained  from  the  two  different 
samples,  so  we  may  compare  regression  coefficients 
when  the  series  of  values  of  the  independent  variate 
are  not  identical  ;  or  if  they  are  identical  we  can  ignore 
the  fact  in  comparing  the  regression  coefficients. 

Ex.  23.  Comparison  of  relative  growth  rate  of 
two  cultures  of  an  alga. — Table  30  shows  the  logarithm 
(to  the  base  10)  of  the  volumes  occupied  by  algal  cells 
on  successive  days,  in  parallel  cultures,  each  taken 
over  a  period  during  which  the  relative  growth  rate 
was  approximately  constant.  In  culture  A  nine 
values  are  available,  and  in  culture  B  eight  (Dr. 
M.  Bristol- Roach's  data). 

The  method  of  finding  Sy(x  -  x)  by  summation  is 
shown  in  the  second  pair  of  columns  :  the  original 
values  are  added  up  from  the  bottom,  giving  successive 
totals  from  6-087  to  43-426  ;  the  final  value  should,  of 
course,  tally  with  the  total  below  the  original  values. 
From  the  sum  of  the  column  of  totals  is  subtracted  the 
sum  of  the  original  values  multiplied  by  5  for  A  and 
by  4 J  for  B.  The  differences  are  Sy(x  -  x)  ;  these 
must  be  divided  by  the  respective  values  of  S(x  -x)2, 
namely,  60  and  42,  to  give  the  values  of  b,  measuring 
the  relative  growth  rates  of  the  two  cultures.     To  test 
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if  the  difference  is  significant  we  calculate  in  the  two 
cases  S(j/2),  and  subtract  successively  the  product  of 
the  mean  with  the  total,  and  the  product  of  b  with 
Sy(x  -  x)  ;  this  process  leaves  the  two  values  of 
S(y  -  Y)2,  which  are  added  as  shown  in  the  table,  and 
the  sum  divided  by  n,  to  give  s2.  The  value  of  n  is 
found  by  adding  the  7  degrees  of  freedom  from  series 

TABLE  30 


Log  Values. 

Summation  Values. 

A. 

B. 

A. 

B. 

Total 
Mean 

3-592 
3823 
4174 
4-534 
4-956 
5163 

5  495 
5602 
6087 

3538 
3-828 
4349 
4-833 
4-911 

5-297 
5-566 
6036 

43-426 
39834 
3601 1 
3r-837 
27303 
22347 
17-184 
11689 
6087 

38358 
34-820 
30992 
26643 
21810 
16899 
11-602 
6036 

S(^-Y)*,  A     -05089 
B     -07563 

ns*     -12652 
s*     -009732 
j!/6o     -0001622 
0-742     -0002317 

•0003939 

Standard  error  -01985 
b'  -  b      0366 

/  1-844 

n   13 

43426 
48251 

38358 
4-7947 

235-718 
217-130 

187-160 
172-611 

Sy(x-Z)   18588 
b                    3098 

14-549 
•3464 

A  to  the  6  degrees  from  series  B,  and  is  therefore  13. 
Estimates  of  the  variance  of  the  two  regression 
coefficients  are  obtained  by  dividing  s2  by  60  and  42, 
and  that  of  the  variance  of  their  difference  is  the  sum 
of  these.  Taking  the  square  root  we  find  the  standard 
error  to  be  -01985,  and  /  =  1-844.  The  difference 
between  the  regression  coefficients,  though  relatively 
large,  cannot  be  regarded  as  significant.  There  is 
not  sufficient  evidence  to  assert  that  culture  B  was 
growing  more  rapidly  than  culture  A. 
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27.  The  Fitting  of  Curved  Regression  Lines 

Little  progress  has  been  made  with  the  theory  of 
the  fitting  of  curved  regression  lines,  save  in  the 
limited  but  most  important  case  when  the  variability 
of  the  independent  variate  is  the  same  for  all  values 
of  the  dependent  variate,  and  is  normal  for  each  such 
value.  When  this  is  the  case  a  technique  has  been 
fully  worked  out  for  fitting  by  successive  stages  any 
line  of  the  form 

Y  =  a  +  bx  +  ex2  +  dx3  +    .   .   .   ; 

we  shall  give  details  of  the  case  where  the  successive 
values  of  x  are  at  equal  intervals. 

As  it  stands  the  above  form  would  be  inconvenient 
in  practice,  in  that  the  fitting  could  not  be  carried 
through  in  successive  stages.  What  is  required  is  to 
obtain  successively  the  mean  of  y,  an  equation  linear 
in  x,  an  equation  quadratic  in  x,  and  so  on,  each 
equation  being  obtained  from  the  last  by  adding,  a  new 
term  being  calculated  by  carrying  a  single  process  of 
computation  through  a  new  stage.  In  order  to  do 
this  we  take 

Y=A+Bf1+Cf2  +  Df3+   .  .  ., 

where  f,,  f2>  ?3  shall  be  functions  of  x  of  the  ist, 
2nd,  and  3rd  degrees,  out  of  which  the  regression 
formula  may  be  built.  It  may  be  shown  that  the 
functions  required  for  this  purpose  may  be  expressed 
in  terms  of  the  moments  of  the  x  distribution,  as 
follows  : 
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71       —  I 
5  2  =  51     ~  ^2  =  51     ~ > 


/X2  20 

54  -51   ~ 5-51    + i~ 


>4     3^2-i3fc2j.3(^2-0(^2-9) 
14  560 


fc     _fc5        ^2/^8  -/*4/*8fc  3  ^/^S  -^62g 

55—Sl     ~ jT?J.     "T «5"l 

/^2/^6-/^42  M2/^6-^42 

_fc5     5(»'2  ~ 7)fc  3  ^  * 5»/4  ~ 23o^'2  +4Q7& 

-51   ~ o — £1  + z £i» 

18  1008 

where  the  values  of  the  moment  functions  have  been 
expressed  in  terms  of  n',  the  number  of  observations, 
as  far  as  is  needed  for  fitting  curves  up  to  the  5th 
degree.  The  values  of  x  are  taken  to  increase  by- 
unity. 

Algebraically  the  process  of  fitting  may  now  be 
represented  by  the  equations 

n 

B  -    ,,  "      .SOft). 

n  (n  2  -  1) 

n  (n2  -  i)(n  2  -4) 

and,  in  general,  the  coefficient  of  the  term  of  the  rth 
degree  is 

(ar)!(ar  +  i)l  s(y^ 


(r!)V(»'a-i)  .  .  .  (^,2-r2) 
As  each  term  is  fitted  the  regression  line  approaches 
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more  nearly  to  the  observed  values,  and  the  sum  of  the 
squares  of  the  deviations 

s(y-yy 

is  diminished.  It  is  desirable  to  be  able  to  calculate 
this  quantity,  without  evaluating  the  actual  values  of 
Y  at  each  point  of  the  series  ;  this  can  be  done  by 
subtracting  from  S(y2)  the  successive  quantities 

12  1 80 

or  more  simply 

AS(y),  BSCyfc),  CS(yf2), 

and  so  on.  These  quantities  represent  the  reduction 
which  the  sum  of  the  squares  of  the  residuals  suffers 
each  time  the  regression  curve  is  fitted  to  a  higher 
degree  ;  and  enable  its  value  to  be  calculated  at  any 
stage  by  a  mere  extension  of  the  process  already  used 
in  the  preceding  examples.  To  obtain  an  estimate,  .s"2, 
of  the  residual  variance,  we  divide  by  n,  the  number  of 
degrees  of  freedom  left  after  fitting,  which  is  found 
from  ri  by  subtracting  from  it  the  number  of  constants 
in  the  regression  formula.  Thus,  if  a  straight  line  has 
been  fitted,  n  =  ri  -  2  ;  while  if  a  curve  of  the  fifth  degree 
has  been  fitted,  n  =ri  -6. 

28.  The  Arithmetical  Procedure  of  Fitting 

The  main  arithmetical  labour  of  fitting  curved 
regression  lines  to  data  of  this  type  may  be  reduced  to 
a  repetition  of  the  process  of  summation  illustrated  in 
Ex.  23.  We  shall  assume  that  the  values  of  y  are 
written  down  in  a  column  in  order  of  increasing  values 
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of  x,  and  that  at  each  stage  the  summation  is  com- 
menced at  the  top  of  the  column  (not  at  the  bottom,  as 
in  that  example).  The  sums  of  the  successive  columns 
will  be  denoted  by  Sx,  S2,  .  .  .  When  these  values  have 
been  obtained,  each  is  divided  by  an  appropriate  divisor, 
which  depends  only  on  n  ,  giving  us  a  new  series  of 
quantities  a,  b,  c,  .  .  .  according  to  the  following 
equations 

^=-,S1=-I7S(y)=j7, 
n  n 

1.2   s 


ri{ri  +  I) 

1.2. 

n'{n  +  i)(V  +2) 


c- Li-3         c 


and  so  on. 

From  these  a  third  series  of  quantities  a ' ,  b\  c ' , 
.  .  .  are  obtained  by  equations  independent  of  ri ',  of 
which  we  give  below  the  first  six,  which  are  enough  to 
carry  the  process  of  fitting  up  to  the  5th  degree  : 

a  =a, 

b'  =a  -b, 

c  =a  -$b  +2c} 

d'  =a  -6b  +  10c  -  $d, 

e  =a  -  \ob  +30*:  ~35^+  14^, 

/'  =  a  -  1 5^  +  70c  -  \\od  +  1  ife  -  42/. 

The  rule  for  the  formation  of  the  coefficients  is  to 
multiply  successively  by 

fjr  +  i)      (r-i)(r  +  2)      (r  -  2){r  +  3) 
>  >  > 

1-2  2.3  3.4 

and  so  on  till  the  series  terminates. 

These    new    quantities    are    proportional    to    the 
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required  coefficients  of  the  regression  equation,  and 
need  only  be  divided  by  a  second  group  of  divisors  to 
give  the  actual  values.     The  equations  are 

n  -  i 

r  -  3°  c'  D  -  I4°  d* 

(n'-i)(n'-2)   '  (n'-iXn'-2)(n'-3)    ' 

E== ^30 ,     F_  2772  f 

(*'-l)(*'-2)...(*'-4)    '    '       (*'- l)  ...(*' -5/  ' 

the  numerical  part  of  the  factor  being 

(2r  +  1)! 

for  the  term  of  degree  r. 

If  an  equation  of  degree  r  has  been  fitted,  the 
estimate  of  the  standard  errors  of  the  coefficients  are 
all  based  upon  the  same  value  of  s2,  i.e. 

,,.      1     fsGr')-»'A»-»'(»"-l)B'-.  .  .  I, 

ri  -r  -  1  (  12  j 

from  which  the  estimated  standard  error  of  any  co- 
efficient, such  as  that  of  ^,  is  obtained  by  dividing  by 

S(f2)= ^ n'(n'2-i)  .   .   .  (n'2-p2) 

and  taking  out  the  square  root.  The  number  of 
degrees  of  freedom  upon  which  the  estimate  is  based 
is  (V  -r  -  1),  and  this  must  be  equated  to  n  in  using 
the  table  of  /. 

A  suitable  example  for  using  this  method  may  be 
obtained  by  fitting  the  values  of  Ex.  22  (p.  118)  with 
a  curve  of  the  second  or  third  degree. 
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28-1.  The  Calculation  of  the  Polynomial  Values 

The  methods  of  the  preceding  sections  provide  an 
analysis  of  a  series  into  the  components  which  can  be 
represented  by  polynomial  terms  of  any  required 
degree,  and  the  remainder  which  cannot  be  so  repre- 
sented. For  much  work  of  this  kind  it  is  desirable  to 
carry  out  this  analysis  without  the  labour  of  calculating 
the  polynomial  values,  Y,  at  each  point  of  the  series. 
Sometimes,  however,  it  is  desirable  to  have  these 
values,  either  to  construct  a  graph,  to  examine  the 
deviations  in  regions  of  special  interest,  or  because 
doing  so  provides  a  completely  satisfactory  check 
upon  the  results  calculated. 

The  very  tedious  procedure  of  calculating  the 
individual  values  of  f,  and  from  them,  and  the  calcu- 
lated coefficients,  forming  the  individual  values  of  the 
polynomial,  may  be  avoided  by  building  up  the  whole 
series,  by  a  continuous  process,  from  its  differences. 
The  process  is  obvious  when  a  straight  line  is  fitted. 
For  the  terminal  value,  and  the  constant  difference 
between  successive  values,  we  take 

AY,  -  -  -«-*, 

71     -  I 

and  build  up  all  the  other  values  of  Y  by  continuous 
addition  of  the  constant  difference.  The  method  is, 
however,  applicable  to  polynomials  of  high  order,  and 
in  such  cases  appears  to  save  more  than  three-quarters 
of  the  labour  of  calculation.  For  curves  of  the  second 
degree  the  equations  are  : 

K 


13° 
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Yx=a'  +36'  +$c't 
6 


AYX 
A2Yi  = 


n  -  1 
60 


(P+50. 


r . 


(«'  -  i)(ri  -2) 

Starting  with  the  terminal  value  AYj,  the  series  of 
first  differences  is  built  up  by  successive  addition  of  the 
constant  second  difference  A2YX  ;  then  starting  from 
Ylt  and  adding  successively  the  first  differences,  the 
series  of  values  of  Y  is  built  up  in  turn. 

The  formulae  for  any  degree  are  constructed  using 
the  factors,  with  alternate  positive  and  negative  signs, 

j       -2.3  3-4-5  -4.5.6.7 

'     n'-i       {n'  -i)(y-2)'     {n'  -\)(n'  -2)(n'  -i)    '   '   ' 

together  with  expressions  in  a',  b\  c' y  .   .   .  with  the 
same  coefficients  whatever  the  degree  of  the  curve. 

The  arithmetical  procedure,  which  consists  almost 
entirely  of  successive  addition,  may  be  illustrated  on 
the  series  of  Ex.  22.  Table  30-1  shows  on  the  left  the 
last  five  lines  of  the  summations  needed  to  fit  a  curve 

TABLE  301 


Observed 
Values. 

1st  Sum. 

2nd  Sum. 

3rd  Sum. 

Poly- 
nomial 
Values. 

ISt 

Difference. 

2nd 
Difference. 

3rd 
Difference. 

-025 
+  931 
-2-94 
+  7-07 
+  269 

1 1788 
127-19 
124-25 
13132 
I34-OI 

96077 
108796 

I2I22I 
1343-53 
147754 

4440-58 
5528-54 
674075 
808428 
956182 

586 
499 
398 
284 

1-544 

•739 

•871 

1-008 

1-148 

1-2919 

-1280 
-1320 
-1361 
- -1402 
-•14423 

•004061 

I34-OI 
4-467000 
4-467000 

147754 
3-I77505 
I  289495 

956I-82 
I   927786 
-I -209947 

39167-21 
0957165 
-•105995 

13400 
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of  the  third  degree,  and  on  the  right  the  first  five  lines 
of  the  summations  by  which  the  polynomial  values 
are  built  up. 

Below  the  first  four  columns  are  shown  the  values 
of  ay  .  .  .,  d  derived  directly  from  the  totals,  and  of 
a',  .  .  .,  d'  derived  from  them.  If  we  want  the 
values  of  Y  to  two  decimal  places,  it  will  be  as  well  to 
calculate  Yx  to  three  places,  and  each  difference  to 
one  more  place  than  the  last,  discarding  one  place  for 
the  subsequent  differences  of  each  series.  With  this 
in  view  six  decimal  places  will  be  sufficient  for 
a,  .  .  .,  d.  Any  further  degree  of  accuracy  required 
may  be  obtained  merely  by  retaining  additional  digits. 
The  sum  of  the  column  of  polynomial  values,  which 
must  tally  with  that  of  those  observed,  provides  an 
excellent  check  of  the  latter  parts  of  the  procedure, 
but  not  of  the  correctness  of  the  initial  summations. 

The  coefficients  used  in  this  method  up  to  curves 
of  the  tenth  degree  are  : 
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29.  Regression  with  several  Independent  Variates 

It  frequently  happens  that  the  data  enable  us  to 
express  the  average  value  of  the  dependent  variate  y, 
in  terms  of  a  number  of  different  independent  variates 
x1}  x2,  .  .  .  xp.  For  example,  the  rainfall  at  any 
point  within  a  district  may  be  recorded  at  a  number 
of  stations  for  which  the  longitude,  latitude,  and  alti- 
tude are  all  known.  If  all  of  these  three  variates 
influence  the  rainfall,  it  may  be  required  to  ascertain 
the  average  effect  of  each  separately.  In  speaking 
of  longitude,  latitude,  and  altitude  as  independent 
variates,  all  that  is  implied  is  that  it  is  in  terms  of 
them  that  the  average  rainfall  is  to  be  expressed  ;  it 
is  not  implied  that  these  variates  vary  independently, 
in  the  sense  that  they  are  uncorrelated.  On  the  con- 
trary, it  may  well  happen  that  the  more  southerly 
stations  lie  on  the  whole  more  to  the  west  than  do  the 
more  northerly  stations,  so  that  for  the  stations  avail- 
able longitude  measured  to  the  west  may  be  nega- 
tively correlated  with  latitude  measured  to  the  north. 
If,  then,  rainfall  increased  to  the  west  but  was 
independent  of  latitude,  we  should  obtain,  merely  by 
comparing  the  rainfall  recorded  at  different  latitudes, 
a  fictitious  regression  indicating  a  falling  off  of  rain 
with  increasing  latitude.  What  we  require  is  an 
equation  taking  account  of  all  three  variates  at  each 
station,  and  agreeing  as  nearly  as  possible  with  the 
values  recorded  ;  this  is  called  a  partial  regression 
equation,  and  its  coefficients  are  known  as  partial 
regression  coefficients. 
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To  simplify  the  algebra  we  shall  suppose  that 
y>  xi>  x2>  %3,  are  all  measured  from  their  mean  values, 
and  that  we  are  seeking  a  formula  of  the  form 

Y  =b1x1  +  b2x2  +  b3x3. 

If  S  stands  for  summation  over  all  the  sets  of  observa- 
tions we  construct  the  three  equations 

^iS(#i2)  +  b2S(x±x2)  +  6zS(x1x3)  =S(x1y), 
^iS^^a)  +  b2S(x22)  +  b3S(x2x3)  =S(x2y), 
b^x^z)  +  b2S(x2x3)  +  b3S(x32)  =  S(x3y), 

of  which  the  nine  coefficients  are  obtained  from  the 
data  either  by  direct  multiplication  and  addition,  or, 
if  the  data  are  numerous,  by  constructing  correlation 
tables  for  each  of  the  six  pairs  of  variates.  The  three 
simultaneous  equations  for  61}  b2,  and  b3  are  solved  in 
the  ordinary  way  ;  first  b3  is  eliminated  from  the  first 
and  third,  and  from  the  second  and  third  equations, 
leaving  two  equations  for  61  and  b2  ;  eliminating  b2 
from  these,  b1  is  found,  and  thence  by  substitution, 
b2  and  b3. 

It  frequently  happens  that,  for  the  same  set  of 
values  of  the  independent  variates,  it  is  desired  to 
examine  the  regressions  for  more  than  one  set  of  values 
of  the  dependent  variate  ;  as,  for  example,  if  for  the 
same  set  of  rainfall  stations  we  had  data  for  several 
different  months  or  years.  In  such  cases  it  is  pre- 
ferable to  avoid  solving  the  simultaneous  equations 
afresh  on  each  occasion,  but  to  obtain  a  simpler 
formula  which  may  be  applied  to  each  new  case. 

This  may  be  done  by  solving  once  and  for  all  the 
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three    sets,    each    consisting    of    three    simultaneous 
equations  : 

^lS(*l2)    +  ^SO^)    +  ^gS^^g)    =   I  ,  O,  O, 

b^x^x^  +  b2S(x22)  +  ^3S(*2*3)  =0,         1  >        o, 
b1S(x1x^)+b2S(xciix^+bzS(xz2)=o,         o,         1; 

the  three  solutions  of  these  three  sets  of  equations  may- 
be written 

°\  =^ll>  ^12>  ^13> 
*2  =^12>  ^22>  ^23> 
*3  =^13>    ^23>    ^33* 

Once  the  six  values  of  c  are  known,  then  the  partial 
regression  coefficients  may  be  obtained  in  any  particular 
case  merely  by  calculating  S(x1y),  S(x2y),  S(x3y)  and 
substituting  in  the  formulae, 

bx  =  clx$(xxy)  +  c12S(x2y)  +  c13S(x3y), 
b2  =c12S(xxy)  +c22S(x2y)  +  c23S(x3y), 
b3  =  c13S(x1y)  +  c23S(x2y)  +  c33S(x3y). 

The  method  of  partial  regression  is  of  very  wide 
application.  It  is  worth  noting  that  the  different 
independent  variates  may  be  related  in  any  way  ; 
for  example,  if  we  desired  to  express  the  rainfall  as 
a  linear  function  of  the  latitude  and  longitude,  and 
as  a  quadratic  function  of  the  altitude,  the  square 
of  the  altitude  would  be  introduced  as  a  fourth  inde- 
pendent variate,  without  in  any  way  disturbing  the 
process  outlined  above,  save  that  S(x3x4)  =  S(x/) 
would  be  calculated  directly  from  the  distribution  of 
altitude. 

In    estimating    the    sampling    errors    of    partial 
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regression  coefficients  we  require  to  know  how  nearly 
our  calculated  value,  Y,  has  reproduced  the  observed 
values  of  y  ;  as  in  previous  cases,  the  sum  of  the  squares 
of  (y  -  Y)  may  be  calculated  by  differences,  for,  with 
three  variates, 

S(y  -Yy=  S(^)  -  b^xtf)  -  bS(x2y)  -  bzS(xzy). 

If  we  had  n'  sets  of  observations,  and/>  independent 
variates,  we  should  therefore  find 


S(y-Y)», 


ri  -p  - 1 


and  to  test  if  b1  differed  significantly  from  any  hypo- 
thetical value,  &,  we  should  calculate 

entering  the  table  of  /  with  n  =  n'  -  p  -  1. 

In  the  practical  use  of  a  number  of  variates  it  is 
convenient  to  use  cards,  on  each  of  which  is  entered  the 
values  of  the  several  variates  which  may  be  required. 
By  sorting  these  cards  in  suitable  grouping  units  with 
respect  to  any  two  variates  the  corresponding  correla- 
tion table  may  be  constructed  with  little  risk  of  error, 
and  thence  the  necessary  sums  of  squares  and  products 
obtained. 

Ex.  24.  Dependence  of  rainfall  on  position  and 
altitude. — The  situations  of  57  rainfall  stations  in 
Hertfordshire  have  a  mean  longitude  i2'-4  W.,  a 
mean  latitude  510  48'- 5  N.,  and  a  mean  altitude  302 
feet.     Taking  as  units   2   minutes  of  longitude,   one 
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minute  of  latitude,  and  20  feet  of  altitude,  the  follow- 
ing values  of  the  sums  of  squares  and  products  of 
deviations  from  the  mean  were  obtained  : 

SO12)  =  1934*1,  S(*2*3)  =  +  1 19-6, 

S(*22)  =2889-5,  S(*3*!)  =    +924-I, 

S(*32)  =  1750-8,  SO^jj)-  -772-2. 

To  find  the  multipliers  suitable  for  any  particular 
set  of  weather  data  from  these  stations,  first  solve  the 
equations 

I934-J  'u-  772-2  c12  +924-1  clz  =  \ 
-772-2  cu  +2889-5  ^12  +  119-6  ^13  =0 
+  924-1   clx+    119-6  <r13  + 1750-8  <r13=o; 

using  the  last  equation  to  eliminate  cvi  from  the  first 
two,  we  have 

2532-3     '11-I462-5     *12  =  I-7508 

1462-5  cn  +  5044-6  ^12=0; 

from  these  eliminate  c12,  obtaining 

10,635-5  cn  =8-8321  ; 
whence 

clx  =  -00083043,     cl2  =  -00024075,     c13  =  -  -00045476, 

the  last  two  being  obtained  successively  by  substitution. 
Since  the  corresponding  equations  for  cl2,  c22,  r23 
differ  only  in  changes  in  the  right-hand  member,  we 
can  at  once  write  down 

-1462-5  ^2  +  5044-6  c22  =  1-7508; 

whence,  substituting  for  cl2  the  value  already  obtained, 
<r22  =  -0004 1 686,     c23  =  -  -ooo  15554  J 
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finally,  to  obtain  czz  we  have  only  to  substitute  in  the 
equation 

924-1  ^3  +  119-6  ^3  +  1750-8  £33  =  i, 
giving  ^33  =  -00082 1 82. 

It  is  usually  worth  while,  to  facilitate  the  detection 
of  small  errors  by  checking,  to  retain  as  above  one 
more  decimal  place  than  the  data  warrant. 

The  partial  regression  of  any  particular  weather 
data  on  these  three  variates  can  now  be  found  with 
little  labour.  In  January  1922  the  mean  rainfall 
recorded  at  these  stations  was  3-87  inches,  and  the 
sums  of  products  of  deviations  with  those  of  the  three 
independent  variates  were  (taking  o- 1  inch  as  the  unit 
for  rain) 

S(x±y)  =  +II37-4,     S(*2:v)  =  -  592-9,     S(^3y)  =  +891-8  ; 

multiplying  these  first  by  c11}  c12,  c13  and  adding,  we 
have  for  the  partial  regression  on  longitude 

bx=  -39624; 

similarly  using  the  multipliers  c12,  c22,  c2Z  we  obtain  for 
the  partial  regression  on  latitude 

b2  =  -  •  1 1  204  ; 

and  finally,  by  using  c13,  c23,  czz, 

*a  =-30787 

gives  the  partial  regression  on  altitude. 

Remembering  now  the  units  employed,  it  appears 
that  in  the  month  in  question  rainfall  increased  by 
•0198  of  an  inch  for  each  minute  of  longitude  west- 
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wards,  it  decreased  by  -on  2  of  an  inch  for  each  minute 
of  latitude  northwards,  and  increased  by  -00154  of  an 
inch  for  each  foot  of  altitude. 

Let  us  calculate  to  what  extent  the  regression  on 
altitude  is  affected  by  sampling  errors.  For  the  57 
recorded  deviations  of  the  rainfall  from  its  mean  value, 
in  the  units  previously  used 

s(yo  =  1786-6; 

whence,  knowing  the  values  of  dlf  b2,  and  bz,  we 
obtain  by  differences 

SCr-Y)2  =994-9- 

To  find  s2,  we  must  divide  this  by  the  number  of 
degrees  of  freedom  remaining  after  fitting  a  formula 
involving  three  variates — that  is,  by  53 — so  that 

s2  =  18-772  ; 

multiplying  this  by  c33  and  taking  the  square  root, 

s\/cn  =-12421. 

Since  n  is  as  high  as  53  we  shall  not  be  far  wrong  in 
taking  the  regression  of  rainfall  on  altitude  to  be  in 
working  units  -308,  with  a  standard  error  -124;  or 
in  inches  of  rain  per  100  feet  as  -154,  with  a  standard 
error  -062. 
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THE    CORRELATION    COEFFICIENT 

30.  No  quantity  is  more  characteristic  of  modern 
statistical  work  than  the  correlation  coefficient,  and  no 
method  has  been  applied  successfully  to  such  various 
data  as  the  method  of  correlation.  Observational 
data  in  particular,  in  cases  where  we  can  observe  the 
occurrence  of  various  possible  contributory  causes  of 
a  phenomenon,  but  cannot  control  them,  has  been 
given  by  its  means  an  altogether  new  importance. 
In  experimental  work  proper  its  position  is  much 
less  central  ;  it  will  be  found  useful  in  the  exploratory 
stages  of  an  enquiry,  as  when  two  factors  which  had 
been  thought  independent  appear  to  be  associated 
in  their  occurrence  ;  but  it  is  seldom,  with  controlled 
experimental  conditions,  that  it  is  desired  to  express 
our  conclusion  in  the  form  of  a  correlation  coefficient. 

One  of  the  earliest  and  most  striking  successes  of 

the  method  of  correlation  was  in  the  biometrical  study 

of  inheritance.     At  a  time  when  nothing  was  known 

of  the  mechanism  of  inheritance,  or  of  the  structure  of 

the  germinal  material,  it  was  possible  by  this  method 

to  demonstrate   the  existence  of  inheritance,   and   to 

140 
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"  measure  its  intensity  "  ;  and  this  in  an  organism  in 
which  experimental  breeding  could  not  be  practised, 
namely,  Man.  By  comparison  of  the  results  obtained 
from  the  physical  measurements  in  man  with  those 
obtained  from  other  organisms,  it  was  established  that 
man's  nature  is  not  less  governed  by  heredity  than 
that  of  the  rest  of  the  animate  world.  The  scope  of 
the  analogy  was  further  widened  by  demonstrating 
that  correlation  coefficients  of  the  same  magnitude  were 
obtained  for  the  mental  and  moral  qualities  in  man 
as  for  the  physical  measurements. 

These  results  are  still  of  fundamental  importance, 
for  not  only  is  inheritance  in  man  still  incapable  of 
experimental  study,  and  existing  methods  of  mental 
testing  are  still  unable  to  analyse  the  mental  disposi- 
tion, but  even  with  organisms  suitable  for  experiment 
and  measurement,  it  is  only  in  the  most  favourable 
cases  that  the  several  factors  causing  fluctuating 
variability  can  be  resolved,  and  their  effects  studied, 
by  Mendelian  methods.  Such  fluctuating  variability, 
with  an  approximately  normal  distribution,  is  character- 
istic of  the  majority  of  the  useful  qualities  of  domestic 
plants  and  animals  ;  and  although  there  is  strong 
reason  to  think  that  inheritance  in  such  cases  is 
ultimately  Mendelian,  the  biometrical  method  of  study 
is  at  present  alone  capable  of  holding  out  hopes  of 
immediate  progress. 

We  give  in  Table  31  an  example  of  a  correlation 
table.  It  consists  of  a  record  in  compact  form  of  the 
stature  of  1376  fathers  and  daughters.  (Pearson  and 
Lee's    data.)      The    measurements    are    grouped    in 
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TABLE 


Height  of 

58-5 

59'5 

605 

61-5 

625 

63-5 

64-5 

65-5 

66-5 

52-5 

•25 

•25 

53-5 

•25 

•25 

54-5 

55*5 

I 

56-5 

•25 

•25 

•25 

1-25 

•5 

I 

•5 

57*5 

•25 

•25 

•5 

i'5 

4'5 

■ 

i-5 

i'5 

2*5 

58-5 

•25 

•75 

•5 

•75 

•75 

1 

1-75 

1-25 

5 

59*5 

•5 

I 

2 

6 

4-75 

5 

6-25 

u-75 

8 

Xi 

o 

d 

a 

6o-5 
6i-5 
62-5 

•75 

•75 
•5 

i 

1-75 

2-25 

2'5 

2 
2 

8 

9-75 

4'5 

6-25 
u-5 
12 

12-5 

13 

22-75 

18-25 

23*75 
26 

20-25 
23*75 
33 

Q 
o 

,bJ0 
'<u 

63-5 

•25 

2 

6 

8-25 

11 

27-25 

35*75 

64-5 

•25 

2'5 

1-75 

3-25 

9-25 

23 

18-75 

65-5 

•5 

1 

•5 

II 

12*25 

9*25 

66-5 

•5 

•5 

i-5 

3-25 

7-25 

8-75 

67-5 

I 

5-75 

7 

68-5 

•25 

•25 

•25 

•25 

i-5 

69-5 

•25 

•25 

•25 

•25 

•25 

70-5 

7i-5 

72-5 

Total 

2 

4'5 

7-5 

14-5 

45 

51*5 

92-5 

155 

178 
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31 


Fathers  in 

Inches. 

67-5 

68-5 

69-5 

70-S 

71-5 

72-5 

73-5 

74-5 

75-5 

Total. 

•5 

•5 

1 

•5 

•5 

•5 

4*5 
I4'5 

2-75 

•5 

•25 

15-5 

3'5 

3'5 

2 

1-75 

•5 

48-5 

11 

9 

4*75 

2*5 

1-25 

1-25 

99 

20-25 

16-5 

10-25 

4-25 

3 

1-25 

I4I-5 

28-25 

24-75 

14-25 

13*75 

4*75 

•75 

•5 

190-5 

37-25 

3i-5 

26-25 

16-25 

7*75 

i-5 

•75 

•25 

212 

28-5 

33 

34-2  5 

24-5 

u-75 

5-5 

1 

•25 

I 

198-5 

19*75 

30 

26-5 

22-25 

15 

4-75 

3-75 

2 

I 

159-5 

16 

26-25 

26-75 

20-5 

18-5 

7-75 

4-25 

•25 

•5 

142-5 

4 

14-25 

13-25 

12 

11-25 

4-5 

3*75 

•75 

77-5 

3 

5*5 

4-25 

5-75 

5*25 

3*75 

2'5 

i'5 

2 

36 

•25 

1 

2-5 

6-5 

2-25 

2-75 

2 

1 

19-5 

1-75 

•25 

4-5 

•75 

1-25 

•75 

•25 

9*5 

•5 

•5 

•5 

i-5 

•75 

•25 

4 

1 

1 

175 

199-5 

166 

135 

82-5 

36-5 

20 

6-5 

4'5 

1376 
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inches,  and  those  whose  measurement  was  recorded  as 
an  integral  number  of  inches  have  been  split ;  thus  a 
father  recorded  as  of  67  inches  would  appear  as  \  under 
66-5  and  \  under  67-5.  Similarly  with  the  daughters  ; 
in  consequence,  when  both  measurements  are  whole 
numbers  the  case  appears  in  four  quarters.  This 
gives  the  table  a  confusing  appearance,  since  the 
majority  of  entries  are  fractional,  although  they  repre- 
sent frequencies.  It  is  preferable,  if  bias  in  measure- 
ment can  be  avoided,  to  group  the  observations  in 
such  a  way  that  each  possible  observation  lies  wholly 
within  one  group. 

The  most  obvious  feature  of  the  table  is  that  cases 
do  not  occur  in  which  the  father  is  very  tall  and  the 
daughter  very  short,  and  vice  versa  ;  the  upper  right- 
hand  and  lower  left-hand  corners  of  the  table  are  blank, 
so  that  we  may  conclude  that  such  occurrences  are  too 
rare  to  occur  in  a  sample  of  about  1400  cases.  The 
observations  recorded  lie  in  a  roughly  elliptical  figure 
lying  diagonally  across  the  table.  If  we  mark  out  the 
region  in  which  the  frequencies  exceed  10  it  appears 
that  this  region,  apart  from  natural  irregularities,  is 
similar,  and  similarly  situated.  The  frequency  of 
occurrence  increases  from  all  sides  to  the  central  region 
of  the  table,  where  a  few  frequencies  over  30  may 
be  seen.  The  lines  of  equal  frequency  are  roughly 
similar  and  similarly  situated  ellipses.  In  the  outer 
zone  observations  occur  only  occasionally,  and  there- 
fore irregularly  ;  beyond  this  we  could  only  explore 
by  taking  a  much  larger  sample. 

The  table  has  been  divided  into  four  quadrants  by 
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marking  out  central  values  of  the  two  variates  ;  these 
values,  67-5  inches  for  the  fathers  and  63-5  inches  for 
the  daughters,  are  near  the  means.  When  the  table 
is  so  divided  it  is  obvious  that  the  lower  right-hand 
and  upper  left-hand  quadrants  are  distinctly  more 
populous  than  the  other  two  ;  not  only  are  more 
squares  occupied,  but  the  frequencies  are  higher.  It 
is  apparent  that  tall  men  have  tall  daughters  more 
frequently  than  the  short  men,  and  vice  versa.  The 
method  of  correlation  aims  at  measuring  the  degree 
to  which  this  association  exists. 

The  marginal  totals  show  the  frequency  distribu- 
tions of  the  fathers  and  the  daughters  respectively. 
These  are  both  approximately  normal  distributions, 
as  is  frequently  the  case  with  biometrical  data  collected 
without  selection.  This  marks  a  frequent  difference 
between  biometrical  and  experimental  data.  An 
experimenter  would  perhaps  have  bred  from  two  con- 
trasted groups  of  fathers  of,  for  example,  63  and 
72  inches  in  height  ;  all  his  fathers  would  then  belong 
to  these  two  classes,  and  the  correlation  coefficient,  if 
used,  would  be  almost  meaningless.  Such  an  experi- 
ment would  serve  to  ascertain  the  regression  of 
daughter's  height  on  father's  height,  and  so  to  deter- 
mine the  effect  on  the  daughters  of  selection  applied 
to  the  fathers,  but  it  would  not  give  us  the  correlation 
coefficient,  which  is  a  descriptive  observational  feature 
of  the  population  as  it  is,  and  may  be  wholly  vitiated 
by  selection. 

Just  as  normal  variation  with  one  variate  may 
be   specified   by   a   frequency   formula   in   which   the 
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logarithm  of  the  frequency  is  a  quadratic  function 
of  the  variate,  so  with  two  variates  the  frequency 
may  be  expressible  in  terms  of  a  quadratic  function 
of  the  values  of  the  two  variates.  We  then  have  a 
normal  correlation  surface,  for  which  the  frequency 
may  conveniently  be  written  in  the  form 

I  i         (x1      zpxy      y*} 

df  = p.    2(i-p*)W   cr,<rs    *t)dxdy. 

27ra1a2\/  I  -  p2 

In  this  expression  x  and  y  are  the  deviations  of 
the  two  variates  from  their  means,  ox  and  a2  are  the 
two  standard  deviations,  and  p  is  the  correlation 
between  x  and  y.  The  correlation  in  the  above 
expression  may  be  positive  or  negative,  but  cannot 
exceed  unity  in  magnitude  ;  it  is  a  pure  number 
without  physical  dimensions.  If  p  =o,  the  expression 
for  the  frequency  degenerates  into  the  product  of  the 

two  factors 

i       _*!  i        .y\ 

e   fi'dx  . -=e   2(r**dy} 


ff1V27T  (72V  27T 

showing  that  the  limit  of  the  normal  correlation  sur- 
face, when  the  correlation  vanishes,  is  merely  that  of 
two  normally  distributed  variates  varying  in  complete 
independence.  At  the  other  extreme,  when  p  is  + 1 
or  -  i,  the  variation  of  the  two  variates  is  in  strict  pro- 
portion, so  that  the  value  of  either  may  be  calculated 
accurately  from  that  of  the  other.  In  other  words,  we 
cease  strictly  to  have  two  variates,  but  merely  two 
measures  of  the  same  variable  quantity. 

If  we  pick  out  the  cases  in  which  one  variate  has 
an  assigned  value,  we  have  what  is  termed  an  array ; 
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the  columns  and  rows  of  the  table  may,  except  as 
regards  variation  within  the  group  limits,  be  regarded 
as  arrays.  With  normal  correlation  the  variation 
within  an  array  may  be  obtained  from  the  general 
formula,  by  giving  x  a  constant  value,  (say)  a,  and 
dividing  by  the  total  frequency  with  which  this  value 
occurs  ;  then  we  have 

dj = = —  -p.    2(1 -p>22^       °'1/ , 

<72\/27r\/l  ~  p2 

showing  (i.)  that  the  variation  of  y  within  the  array  is 
normal  ;  (ii.)  that  the  mean  value  of  y  for  that  array  is 
paa2l<j1,  so  that  the  regression  of  y  on  x  is  linear,  with 
regression  coefficient 

P-  I 

and  (iii.)  that  the  variance  of  y  within  the  array  is 
a22(i  -p2),  and  is  the  same  within  each  array.  We 
may  express  this  by  saying  that  of  the  total  variance 
of  y  the  fraction  (1  -  p2)  is  independent  of  x,  while 
the  remaining  fraction,  p2,  is  determined  by,  or  cal- 
culable from,  the  value  of  x. 

These  relations  are  reciprocal ;  the  regression  of  x 
on  y  is  linear,  with  regression  coefficient  pa1JG2  ;  the 
correlation  p  is  thus  the  geometric  mean  of  the  two 
regressions.  The  two  regression  lines  representing 
the  mean  value  of  x  for  given  y,  and  the  mean  value  of 
y  for  given  x,  cannot  coincide  unless  p=  ±1.  The 
variation  of  x  within  an  array  in  which  y  is  fixed  is 
normal  with  variance  equal  to  (7i2(i. -p2),  so  that  we 
may  say  that  of  the  variance  of  x  the  fraction  (1  -  p2) 
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is  independent  of  y,  and  the  remaining  fraction,  p2,  is 
determined  by,  or  calculable  from,  the  value  of  y. 

Such  are  the  formal  mathematical  consequences 
of  normal  correlation.  Much  biometric  data  certainly 
shows  a  general  agreement  with  the  features  to  be 
expected  on  this  assumption  ;  though  I  am  not 
aware  that  the  question  has  been  subjected  to  any 
sufficiently  critical  enquiry.  Approximate  agreement 
is  perhaps  all  that  is  needed  to  justify  the  use  of  the 
correlation  as  a  quantity  descriptive  of  the  popula- 
tion ;  its  efficacy  in  this  respect  is  undoubted,  and  it 
is  not  improbable  that  in  some  cases  it  affords  a  com- 
plete description  of  the  simultaneous  variation  of  the 
variates. 

31.  The  Statistical  Estimation  of  the  Correlation 

Just  as  the  mean  and  the  standard  deviation  of  a 
normal  population  in  one  variate  may  be  most  satis- 
factorily estimated  from  the  first  two  moments  of  the 
observed  distribution,  so  the  only  satisfactory  estimate 
of  the  correlation,  when  the  variates  are  normally 
correlated,  is  found  from  the  "  product  moment  ".  If 
x  and  y  represent  the  deviations  of  the  two  variates 
from  their  means,  we  calculate  the  three  statistics 
slt  s2,  r  by  the  three  equations 

nsx2  =  S(x2),     ns22  =  S(y2),     nrs1s2  =  S(xy)  ; 

then  s1  and  s2  are  estimates  of  the  standard  deviations 
ctx  and  <t2,  and  r  is  an  estimate  of  the  correlation  p. 
Such  an  estimate  is  called  the  correlation  coefficient, 
or  the  product  moment  correlation,    the   latter   term 
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referring  to  the  summation  of  the  product  terms,  xy, 
in  the  last  equation. 

The  above  method  of  calculation  might  have  been 
derived  from  the  consideration  that  the  correlation 
of  the  population  is  the  geometric  mean  of  the  two 
regression  coefficients  ;  for  our  estimates  of  these  two 
regressions  would  be 

S£*>  and  SM, 

s(*2)        so*)' 

so  that  it  is  in  accordance  with  these  estimates  to  take 
as  our  estimate  of  p 

„_         S(*y) 

Vs(*2) .  s(y2)' 

which  is  in  fact  the  product  moment  correlation. 

Ex.  25.  Parental  correlation  in  stature. — The 
numerical  work  required  to  calculate  the  correlation 
coefficient  is  shown  below  in  Table  32. 

The  first  eight  columns  require  no  explanation, 
since  they  merely  repeat  the  usual  process  of  finding 
the  mean  and  standard  deviation  of  the  two  marginal 
distributions.  It  is  not  necessary  actually  to  find  the 
mean,  by  dividing  the  total  of  the  third  column, 
480-5,  by  1376,  since  we  may  work  all  through  with 
the  undivided  totals.  The  correction  for  the  fact  that 
our  working  mean  is  not  the  true  mean  is  performed 
by  subtracting  (480- 5)* -^1376  in  the  4th  column; 
a  similar  correction  appears  at  the  foot  of  the  8th 
column,  and  at  the  foot  of  the  last  column.  The 
correction  for  the  sum  of  products  is  performed  by 
subtracting  480-5x260-5^1376.      This    correction   of 
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the  product  term  may  be  positive  or  negative  ;  if  the 
total  deviations  of  the  two  variates  are  of  opposite  sign, 
the  correction  must  be  added.  The  sum  of  squares, 
with  and  without  Sheppard's  correction  (1376-!- 12), 
are  shown  separately  ;  there  is  no  corresponding 
correction  to  be  made  to  the  product  term. 

The  9th  column  shows  the  total  deviations  of  the 
daughter's  height  for  each  of  the  18  columns  in  which 
Table  31   is  divided.     When  the  numbers  are  small, 
these  may  usually  be  written  down  by  inspection  of 
the  table.      In  the  present  case,  where  the  numbers 
are  large,  and  the  entries  are  complicated  by  quarter- 
ing,  more  care  is  required.     The  total  of  column  9 
checks  with  that  of  the  3rd  column.     In  order  that  it 
shall  do  so,  the  central  entry +  15-5,  which  does  not 
contribute  to  the  products,  has  to  be  included.     Each 
entry  in  the  9th  column  is  multiplied  by  the  paternal 
deviation  to  give  the   10th  column.      In  the  present 
case  all  the  entries  in  column   10  are  positive  ;    fre- 
quently both  positive  and  negative  entries  occur,  and 
it  is  then  convenient  to  form  a  separate  column  for 
each.     A  useful  check  is  afforded  by  repeating  the 
work    of   the    last    two    columns,    interchanging    the 
variates ;  we  should  then  find  the  total  deviation  of 
the  fathers  for  each  array  of  daughters,  and  multiply 
by  the  daughters  deviation.      The  uncorrected  totals, 
5136-25,  should  then  agree.     This  check  is  especially 
useful  with  small  tables,  in  which  the  work  of  the  last 
two  columns,  carried  out  rapidly,  is  liable  to  error. 

The    value    of   the    correlation    coefficient,    using 
Sheppard's  correction,  is  found  by  dividing  5045-28 
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by  the  geometric  mean  of  9209-0  and  10,392-5  ;  its 
value  is  +-5157.  If  ShepparcTs  correction  had  not 
been  used,  we  should  have  obtained  +-5097.  The 
difference  is  in  this  case  not  large  compared  to  the 
errors  of  random  sampling,  and  the  full  effects  on  the 
distribution  in  random  samples  of  using  Sheppard's 
correction  have  never  been  fully  examined,  but  there 
can  be  little  doubt  that  Sheppard's  correction  should 
be  used,  and  that  its  use  gives  generally  an  improved 
estimate  of  the  correlation.  On  the  other  hand,  the 
distribution  in  random  samples  of  the  uncorrected 
value  is  simpler  and  better  understood,  so  that  the  un- 
corrected value  should  be  used  in  tests  of  significance, 
in  which  the  effect  of  correction  need  not,  of  course,  be 
overlooked.  For  simplicity  coarse  grouping  should 
be  avoided  where  such  tests  are  intended.  The  fact 
that  with  small  samples  the  correlation  obtained  by 
the  use  of  Sheppard's  correction  may  exceed  unity, 
illustrates  the  disturbance  introduced  into  the  random 
sampling  distribution. 


32.  Partial  Correlations 

A  great  extension  of  the  utility  of  the  idea  of 
correlation  lies  in  its  application  to  groups  of  more  than 
two  variates.  In  such  cases,  where  the  correlation 
between  each  pair  of  three  variates  is  known,  it  is 
possible  to  eliminate  any  one  of  them,  and  so  find  what 
the  correlation  of  the  other  two  would  be  in  a  popula- 
tion selected  so  that  the  third  variate  was  constant. 

Ex.  26.  Elimination  of  age  in  organic  correlations 
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with  growing  children. — For  example,  it  was  found 
(Mumford  and  Young's  data)  in  a  group  of  boys  of 
different  ages,  that  the  correlation  of  standing  height 
with  chest  girth  was  +-836.  One  might  expect  that 
part  of  this  association  was  due  to  general  growth  with 
age.  It  would  be  more  desirable  for  many  purposes 
to  know  the  correlation  between  the  variates  for  boys 
of  a  given  age  ;  but  in  fact  only  a  few  of  the  boys  will 
be  exactly  of  the  same  age,  and  even  if  we  make  age 
groups  as  broad  as  a  year,  we  shall  have  in  each  group 
much  fewer  than  the  total  number  measured.  In 
order  to  utilise  the  whole  material,  we  only  need  to 
know  the  correlations  of  standing  height  with  age,  and 
of  chest  girth  with  age.  These  are  given  as  -714  and 
•708. 

The   fundamental    formula    in   calculating   partial 
correlation  coefficients  may  be  written 


r12-3  — 


V(i-r132)(i-r232) 

Here  the  three  variates  are  numbered  1,  2,  and  3,  and 
we  wish  to  find  the  correlation  between  1  and  2,  when 
3  is  eliminated  ;  this  is  called  the  "  partial  "  correlation 
between  1  and  2,  and  is  designated  by  r12.3,  to  show 
that  variate  3  has  been  eliminated.  The  symbols  r12) 
^i3>  ^23  indicate  the  correlations  found  directly  between 
each  pair  of  variates  ;  these  correlations  being  dis- 
tinguished as  "  total  "  correlations. 

Inserting  the  numerical  values  in  the  above 
formula  we  find  r12.3  =  -668,  showing  that  when  age 
is  eliminated  the  correlation,  though  still  considerable, 
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has  been  markedly  reduced.  The  mean  value  given 
by  the  above-mentioned  authors  for  the  correlations 
found  by  grouping  the  boys  by  years,  is  -653,  not  a 
greatly  different  value.  In  a  similar  manner,  two  or 
more  variates  may  be  eliminated  in  succession  ;  thus 
with  four  variates,  we  may  first  eliminate  variate  4, 
by  thrice  applying  the  above  formula  to  find  r12.4, 
r13A,  and  r23A.  Then  applying  the  same  formula 
again,  to  these  three  new  values,  we  have 


124 


V/(l-r13.42Xl-r23.42) 


The  labour  increases  rapidly  with  the  number  of 
variates  to  be  eliminated.  To  eliminate  s  variates, 
the  number  of  operations  involved,  each  one  applica- 
tion of  the  above  formula,  is  $s(s  +  i)(s  +  2)  ;  for 
values  of  s  from  1  to  6  this  gives  1,  4,  10,  20,  35,  56 
operations.  Much  of  this  labour  may  be  saved  by 
using  tables  of  \/i  -r2  such  as  that  published  by 
J.  R.  Miner. 

The  meaning  of  the  correlation  coefficient  should 
be  borne  clearly  in  mind.  The  original  aim  to 
measure  the  "  strength  of  heredity  "  by  this  method 
was  based  clearly  on  the  supposition  that  the  whole 
class  of  factors  which  tend  to  make  relatives  alike,  in 
contrast  to  the  unlikeness  of  unrelated  persons,  may 
be  grouped  together  as  heredity.  That  this  is  so  for 
all  practical  purposes  is,  I  believe,  admitted,  but  the 
correlation  does  not  tell  us  that  this  is  so  ;  it  merely 
tells  us  the  degree  of  resemblance  in  the  actual  popula- 
tion studied,   between  father  and  daughter.      It  tells 


THE  CORRELATION  COEFFICIENT  155 

us  to  what  extent  the  height  of  the  father  is  relevant 
information  respecting  the  height  of  the  daughter,  or, 
otherwise  interpreted,  it  tells  us  the  relative  importance 
of  the  factors  which  act  alike  upon  the  heights  of  father 
and  daughter,  compared  to  the  totality  of  factors  at 
work.  If  we  know  that  B  is  caused  by  A,  together 
with  other  factors  independent  of  A,  and  that  B  had 
no  influence  on  A,  then  the  correlation  between  A 
and  B  does  tell  us  how  important,  in  relation  to  the 
other  causes  at  work,  is  the  influence  of  A.  If  we  have 
not  such  knowledge,  the  correlation  does  not  tell  us 
whether  A  causes  B,  or  B  causes  A,  or  whether  both 
influences  are  at  work,  together  with  the  effects  of 
common  causes. 

This  is  true  equally  of  partial  correlations.  If  we 
know  that  a  phenomenon  A  is  not  itself  influential  in 
determining  certain  other  phenomena  B,  C,  D,  .  .  ., 
but  on  the  contrary  is  probably  directly  influenced  by 
them,  then  the  calculation  of  the  partial  correlations 
A  with  B,  C,  D,  .  .  .,  in  each  case  eliminating  the 
remaining  values,  will  form  a  most  valuable  analysis 
of  the  causation  of  A.  If  on  the  contrary  we  choose 
a  group  of  social  phenomena  with  no  antecedent 
knowledge  of  the  causation  or  absence  of  causation 
among  them,  then  the  calculation  of  correlation 
coefficients,  total  or  partial,  will  not  advance  us  a 
step  towards  evaluating  the  importance  of  the  causes 
at  work. 

The  correlation  between  A  and  B  measures,  on  a 
conventional  scale,  the  importance  of  the  factors  which 
(on  a  balance  of  like  and  unlike  action)  act  alike  in 
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both  A  and  B,  as  against  the  remaining  factors  which 
affect  A  and  B  independently.  If  we  eliminate  a  third 
variate  C,  we  are  removing  from  the  comparison  all 
those  factors  which  become  inoperative  when  C  is 
fixed.  If  these  are  only  those  which  affect  A  and  B 
independently,  then  the  correlation  between  A  and  B, 
whether  positive  or  negative,  will  be  numerically 
increased.  We  shall  have  eliminated  irrelevant  dis- 
turbing factors,  and  obtained,  as  it  were,  a  better 
controlled  experiment.  We  may  also  require  to 
eliminate  C  if  these  factors  act  alike,  or  oppositely  on 
the  two  variates  correlated ;  in  such  a  case  the  varia- 
bility of  C  actually  masks  the  effect  we  wish  to  in- 
vestigate. Thirdly,  C  may  be  one  of  the  chain  of 
events  by  the  mediation  of  which  A  affects  B,  or  vice 
versa.  The  extent  to  which  C  is  the  channel  through 
which  the  influence  passes  may  be  estimated  by 
eliminating  C  ;  as  one  may  demonstrate  the  small 
effect  of  latent  factors  in  human  heredity  by  finding 
the  correlation  of  grandparent  and  grandchild,  elimi- 
nating the  intermediate  parent.  In  no  case,  however, 
can  we  judge  whether  or  not  it  is  profitable  to  eliminate 
a  certain  variate  unless  we  know,  or  are  willing  to 
assume,  a  qualitative  scheme  of  causation.  For  the 
purely  descriptive  purpose  of  specifying  a  population 
in  respect  of  a  number  of  variates,  either  partial  or 
total  correlations  are  effective,  and  correlations  of  either 
type  may  be  of  interest. 

As  an  illustration  we  may  consider  in  what  sense  the 
coefficient  of  correlation  does  measure  the  "  strength 
of  heredity/'  assuming  that  heredity  only  is  concerned 
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in  causing  the  resemblance  between  relatives  ;  that 
is,  that  any  environmental  effects  are  distributed  at 
haphazard.  In  the  first  place,  we  may  note  that  if 
such  environmental  effects  are  increased  in  magni- 
tude, the  correlations  would  be  reduced  ;  thus  the 
same  population,  genetically  speaking,  would  show 
higher  correlations  if  reared  under  relatively  uniform 
nutritional  conditions,  than  they  would  if  the  nutri- 
tional conditions  had  been  very  diverse  ;  although  the 
genetical  processes  in  the  two  cases  were  identical. 
Secondly,  if  environmental  effects  were  at  all  influential 
(as  in  the  population  studied  seems  not  to  be  indeed  the 
case),  we  should  obtain  higher  correlations  from  a 
mixed  population  of  genetically  very  diverse  strains, 
than  we  should  from  a  more  uniform  population. 
Thirdly,  although  the  influence  of  father  on  daughter 
is  in  a  certain  sense  direct,  in  that  the  father  contri- 
butes to  the  germinal  composition  of  his  daughter,  we 
must  not  assume  that  this  fact  is  necessarily  the  cause 
of  the  whole  of  the  correlation  ;  for  it  has  been  shown 
that  husband  and  wife  also  show  considerable  resem- 
blance in  stature,  and  consequently  taller  fathers  tend 
to  have  taller  daughters  partly  because  they  choose,  or 
are  chosen  by,  taller  wives.  For  this  reason,  for 
example,  we  should  expect  to  find  a  noticeable  positive 
correlation  between  step-fathers  and  step-daughters  ; 
also  that,  when  the  stature  of  the  wife  is  eliminated, 
the  partial  correlation  between  father  and  daughter 
will  be  found  to  be  lower  than  the  total  correlation. 
These  considerations  serve  to  some  extent  to  define  the 
sense  in  which  the  somewhat  vague  phrase,  "  strength 
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of  heredity/'  must  be  interpreted,  in  speaking  of  the 
correlation  coefficient.  It  will  readily  be  understood 
that,  in  less  well  understood  cases,  analogous  considera- 
tions may  be  of  some  importance,  and  should  if 
possible  be  critically  considered. 

33.  Accuracy  of  the  Correlation  Coefficient 

With  large  samples,  and  moderate  or  small  corre- 
lations, the  correlation  obtained  from  a  sample  of  n 
pairs  of  values  is  distributed  normally  about  the  true 
value  p,  with  variance, 

(i-/>2)2. 
n  -  I 

it  is  therefore  usual  to  attach  to  an  observed  value  r, 
a  standard  error  (i  -r2)l\/n  -  i,  or  (i  -r^jy/n.  This 
procedure  is  only  valid  under  the  restrictions  stated 
above  ;  with  small  samples  the  value  of  r  is  often  very 
different  from  the  true  value,  p,  and  the  factor  i  -r2, 
correspondingly  in  error  ;  in  addition  the  distribution 
of  r  is  far  from  normal,  so  that  tests  of  significance 
based  on  the  above  formula  are  often  very  deceptive. 
Since  it  is  with  small  samples,  less  than  ioo,  that  the 
practical  research  worker  ordinarily  wishes  to  use  the 
correlation  coefficient,  we  shall  give  an  account  of  more 
accurate  methods  of  handling  the  results. 

In  all  cases  the  procedure  is  alike  for  total  and  for 
partial  correlations.  Exact  account  may  be  taken  of 
the  differences  in  the  distributions  in  the  two  cases, 
by  deducting  unity  from  the  sample  number  for  each 
variate  eliminated  ;    thus  a  partial  correlation  found 
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by  eliminating  three  variates,  and  based  on  data 
giving  13  values  for  each  variate,  is  distributed  exactly 
as  is  a  total  correlation  based  on  10  pairs  of  values. 


34.  The  Significance  of  an  Observed  Correlation 

In  testing  the  significance  of  an  observed  correla- 
tion we  require  to  calculate  the  probability  that  such 
a  correlation  should  arise,  by  random  sampling,  from 
an  uncorrelated  population.  If  the  probability  is  low 
we  regard  the  correlation  as  significant.  The  table 
of  /  given  at  the  end  of  the  preceding  chapter  (p.  139) 
may  be  utilised  to  make  an  exact  test.  If  ri  be  the 
number  of  pairs  of  observations  on  which  the  correla- 
tion is  based,  and  r  the  correlation  obtained,  without 
using  Sheppard's  correction,  then  we  take 

r 


t  =—==,.  vV-2, 

n  =ri  -  2, 

and  it  may  be  demonstrated  that  the  distribution 
of  /  so  calculated,  will  agree  with  that  given  in  the 
table. 

It  should  be  observed  that  this  test,  as  is  obviously 
necessary,  is  identical  with  that  given  in  the  last 
chapter  for  testing  whether  or  not  the  linear  regression 
coefficient  differs  significantly  from  zero. 

Table  V.A.  (p.  176)  allows  this  test  to  be  applied 
directly  from  the  value  of  r,  for  samples  up  to  100 
pairs  of  observations.  Taking  the  four  definite  levels 
of  significance,  represented  by  P=-io,   -05,   -02,  and 
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•01,  the  table  shows  for  each  value  of  n,  from  i  to  20, 
and  thence  by  larger  intervals  to  100,  the  corresponding 
values  of  r. 

Ex.  27.  Significance  of  a  correlation  coefficient 
between  autumn  rainfall  and  wheat  crop. — For  the 
twenty  years,  1 885-1 904,  the  mean  wheat  yield  of 
Eastern  England  was  found  to  be  correlated  with  the 
autumn  rainfall;  the  correlation  found  was  --629. 
Is  this  value  significant  ?    We  obtain  in  succession 

1  -  r2  =  -6044, 
Vi  -r2=-7774, 
rjy/  \  -r2  =  --8091, 
'-  -3-433- 

For  n  =  18,  this  shows  that  P  is  less  than  -oi,  and  the 
correlation  is  definitely  significant.  The  same  con- 
clusion may  be  read  off  at  once  from  Table  V.A. 
entered  with  n  =  18. 

If  we  had  applied  the  standard  error, 


1  - 

r2 

r 

yV 

iV*' 

-  1 

we  should  have 

*-- =   r^V»'-i  =  -4-536, 

a  much  greater  value  than  the  true  one,  very  much 
exaggerating  the  significance.  In  addition,  assuming 
that  r  was  normally  distributed  («=oo),  the  signifi- 
cance of  the  result  would  be  even  further  exaggerated. 
This  illustration  will  suffice  to  show  how  deceptive,  in 
small  samples,  is  the  use  of  the  standard  error  of  the 
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correlation  coefficient,  on  the  assumption  that  it  will 
be  normally  distributed.  Without  this  assumption 
the  standard  error  is  without  utility.  The  misleading 
character  of  the  formula  is  increased  if  ri  is  substituted 
for  ri  -  i,  as  is  often  done.  Judging  from  the  normal 
deviate  4-536,  we  should  suppose  that  the  correlation 
obtained  would  be  exceeded  in  random  samples  from 
uncorrelated  material  only  6  times  in  a  million  trials. 
Actually  it  would  be  exceeded  about  3000  times  in 
a  million  trials,  or  with  500  times  the  frequency 
supposed. 

It  is  necessary  to  warn  the  student  emphatically 
against  the  misleading  character  of  the  standard  error 
of  the  correlation  coefficient  deduced  from  a  small 
sample,  because  the  principal  utility  of  the  correlation 
coefficient  lies  in  its  application  to  subjects  of  which 
little  is  known,  and  upon  which  the  data  are  rela- 
tively scanty.  With  extensive  material  appropriate 
for  biometrical  investigations  there  is  little  danger 
of  false  conclusions  being  drawn,  whereas  with  the 
comparatively  few  cases  to  which  the  experimenter 
must  often  look  for  guidance,  the  uncritical  applica- 
tion of  methods  standardised  in  biometry,  must  be  so 
frequently  misleading  as  to  endanger  the  credit  of  this 
most  valuable  weapon  of  research.  It  is  not  true,  as 
the  above  example  shows,  that  valid  conclusions  cannot 
be  drawn  from  small  samples  ;  if  accurate  methods 
are  used  in  calculating  the  probability,  we  thereby 
make  full  allowance  for  the  size  of  the  sample,  and 
should  be  influenced  in  our  judgment  only  by  the  value 
of  the  probability  indicated.     The  great  increase  of 

M 
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certainty  which  accrues  from  increasing  data  is 
reflected  in  the  value  of  P,  if  accurate  methods  are 
used. 

Ex.  28.  Significance  of  a  partial  correlation 
coefficient. — In  a  group  of  32  poor  law  relief  unions, 
Yule  found  that  the  percentage  change  from  1881  to 
1 89 1  in  the  percentage  of  the  population  in  receipt  of 
relief  was  correlated  with  the  corresponding  change  in 
the  ratio  of  the  numbers  given  outdoor  relief  to  the 
numbers  relieved  in  the  workhouse,  when  two  other 
variates  had  been  eliminated,  namely,  the  correspond- 
ing changes  in  the  percentage  of  the  population  over 
65,  and  in  the  population  itself. 

The  correlation  found  by  Yule  after  eliminating 
the  two  variates  was  +-457  ;  such  a  correlation  is 
termed  a  partial  correlation  of  the  second  order.  Test 
its  significance. 

It  has  been  demonstrated  that  the  distribution  in 
random  samples  of  partial  correlation  coefficients  may 
be  derived  from  that  of  total  correlation  coefficients 
merely  by  deducting  from  the  number  of  the  sample, 
the  number  of  variates  eliminated.  Deducting  2  from 
the  32  unions  used,  we  have  30  as  the  effective  number 
of  the  sample  ;   hence 

n  =28. 

Calculating  t  from  r  as  before,  we  find 

'  =  2-719, 

whence  it  appears  from  the  table  that  P  lies  between 
•02  and  -oi.  The  correlation  is  therefore  significant. 
This,  of  course,  as  in  other  cases,  is  on  the  assump- 
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tion  that  the  variates  correlated  (but  not  necessarily 
those  eliminated)  are  normally  distributed  ;  economic 
variates  seldom  themselves  give  normal  distributions, 
but  the  fact  that  we  are  here  dealing  with  rates  of 
change  makes  the  assumption  of  normal  distribution 
much  more  plausible.  The  values  given  in  Table  V.  A 
for  72  =  25,  and  72=30,  give  a  sufficient  indication  of 
the  level  of  significance  attained  by  this  observation. 


35.  Transformed  Correlations 

In  addition  to  testing  the  significance  of  a  correla- 
tion, to  ascertain  if  there  is  any  substantial  evidence  of 
association  at  all,  it  is  also  frequently  required  to 
perform  one  or  more  of  the  following  operations,  for 
each  of  which  the  standard  error  would  be  used  in  the 
case  of  a  normally  distributed  quantity.  With  cor- 
relations derived  from  large  samples  the  standard 
error  may,  therefore,  be  so  used,  except  when  the 
correlation  approaches  ±  1  ;  but  with  small  samples 
such  as  frequently  occur  in  practice,  special  methods 
must  be  applied  to  obtain  reliable  results. 

(i.)  To    test    if   an    observed    correlation    differs 

significantly  from  a  given  theoretical  value. 

(ii.)  To    test    if    two    observed    correlations    are 

significantly  different, 
(iii.)   If  a  number  of  independent  estimates  of  a 
correlation  are  available,  to  combine  them 
into  an  improved  estimate, 
(iv.)  To    perform    tests    (i.)    and    (ii.)    with    such 
average  values. 
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Problems  of  these  kinds  may  be  solved  by  a  method 
analogous  to  that  by  which  we  have  solved  the  problem 
of  testing  the  significance  of  an  observed  correlation. 
In  that  case  we  were  able  from  the  given  value  r  to 
calculate  a  quantity  t  which  is  distributed  in  a  known 
manner,  for  which  tables  were  available.  The  trans- 
formation led  exactly  to  a  distribution  which  had 
already  been  studied.  The  transformation  which  we 
shall  now  employ  leads  approximately  to  the  normal 
distribution  in  which  all  the  above  tests  may  be  carried 
out  without  difficulty.      Let 

then  as  r  changes  from  o  to  I,  z  will  pass  from  o  to  oo  . 
For  small  values  of  r,  z  is  nearly  equal  to  r>  but  as 
r  approaches  unity,  z  increases  without  limit.  For 
negative  values  of  r,  z  is  negative.  The  advantage  of 
this  transformation  lies  in  the  distribution  of  the  two 
quantities  in  random  samples.  The  standard  devia- 
tion of  r  depends  on  the  true  value  of  the  correlation, 
p,  as  is  seen  from  the  formula 


°r  = 


i-p> 


vV-i 

Since  p  is  unknown,  we  have  to  substitute  for  it  the 
observed  value  r,  and  this  value  will  not,  in  small 
samples,  be  a  very  accurate  estimate  of  p.  The 
standard  error  of  z  is  simpler  in  form, 


vV-3 

and   is   practically   independent   of  the   value   of  the 
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correlation  in  the  population  from  which  the  sample  is 
drawn. 

In  the  second  place  the  distribution  of  r  is  skew  in 
small  samples,  and  even  for  large  samples  it  remains 
very  skew  for  high  correlations.  The  distribution  of 
z  is  not  strictly  normal,  but  it  tends  to  normality  rapidly 
as  the  sample  is  increased,  whatever  may  be  the  value 
of  the  correlation.  We  shall  give  examples  to  test 
the  effect  of  the  departure  of  the  z  distribution  from 
normality. 

Finally  the  distribution  of  r  changes  its  form 
rapidly  as  p  is  changed  ;  consequently  no  attempt  can 
be  made,  with  reasonable  hope  of  success,  to  allow  for 
the  skewness  of  the  distribution.  On  the  contrary,  the 
distribution  of  z  is  nearly  constant  in  form,  and  the 
accuracy  of  tests  may  be  improved  by  small  correc- 
tions for  skewness  ;  such  corrections  are,  however,  in 
any  case  somewhat  laborious,  and  we  shall  not  deal 
with  them.  The  simple  assumption  that  z  is  normally 
distributed  will  in  all  ordinary  cases  be  sufficiently 
accurate. 

These  three  advantages  of  the  transformation  from 
r  to  z  may  be  seen  by  comparing  Figs.  7  and  8.  In 
Fig.  7  are  shown  the  actual  distributions  of  r,  for  8 
pairs  of  observations,  from  populations  having  cor- 
relations o  and  o-8  ;  Fig.  8  shows  the  corresponding 
distribution  curves  for  z.  The  two  curves  in  Fig.  7 
are  widely  different  in  their  modal  heights  ;  both  are 
distinctly  non-normal  curves  ;  in  form  also  they  are 
strongly  contrasted,  the  one  being  symmetrical,  the 
other    highly    unsymmetrical.     On    the    contrary,    in 
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Fig.  8  the  two  curves  do  not  differ  greatly  in  height ; 


-2  O  -2  -4  -6 

VALUE   OF   r  OBSERVED 

Fig.  7. 


Fig.  8. 


although  not  exactly  normal  in  form,  they  come  so 
close  to  it,  even  for  a  small  sample  of  8  pairs  of  observa- 
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tions,  that  the  eye  cannot  detect  the  difference  ;  and 
this  approximate  normality  holds  up  to  the  extreme 
limits  p  =  ±1.  One  additional  feature  is  brought  out 
by  Fig.  8  ;  in  the  distribution  for  p  =  o-8,  although  the 
curve  itself  is  as  symmetrical  as  the  eye  can  judge  of, 
yet  the  ordinate  of  zero  error  is  not  centrally  placed. 
The  figure,  in  fact,  reveals  the  small  bias  which  is 
introduced  into  the  estimate  of  the  correlation  co- 
efficient as  ordinarily  calculated ;  we  shall  treat 
further  of  this  bias  in  the  next  section,  and  in  the 
following  chapter  shall  deal  with  a  similar  bias  intro- 
duced in  the  calculation  of  intraclass  correlations. 

To  facilitate  the  transformation  we  give  in  Table 
V.B  (p.  177)  the  values  of  r  corresponding  to  values  of 
z,  proceeding  by  intervals  of  -oi,  from  o  to  3.  In  the 
earlier  part  of  this  table  it  will  be  seen  that  the  values 
of  r  and  z  do  not  differ  greatly ;  but  with  higher  cor- 
relations small  changes  in  r  correspond  to  relatively 
large  changes  in  z.  In  fact,  measured  on  the  ^-scale, 
a  correlation  of  -99  differs  from  a  correlation  -95 
by  more  than  a  correlation  -6  exceeds  zero.  The 
values  of  z  give  a  truer  picture  of  the  relative  import- 
ance of  correlations  of  different  sizes,  than  do  the 
values  of  r. 

To  find  the  value  of  z  corresponding  to  a  given 
value  of  r,  say  -6,  the  entries  in  the  table  lying  on 
either  side  of  -6  are  first  found,  whence  we  see  at  once 
that  z  lies  between  -69  and  -70  ;  the  interval  between 
these  entries  is  then  divided  proportionately  to  find  the 
fraction  to  be  added  to  69.  In  this  case  we  have 
20/64,  or  '31*  so  tnat  £  =  '6931.     Similarly,  in  finding 
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the  value  of  r  corresponding  to  any  value  of  z,  say 
•9218,  we  see  at  once  that  it  lies  between  -7259  and 
•7306  ;  the  difference  is  47,  and  18  per  cent  of  this 
gives  8  to  be  added  to  the  former  value,  giving  us 
finally  ^  =  -7267.  The  same  table  may  thus  be  used 
to  transform  r  into  z,  and  to  reverse  the  process. 

Ex.  29.  Test  of  the  approximate  normality  of  the 
distribution  of  z. — In  order  to  illustrate  the  kind  of 
accuracy  obtainable  by  the  use  of  z,  let  us  take  the 
case  that  has  already  been  treated  by  an  exact  method 
in  Ex.  27.  A  correlation  of  --629  has  been  obtained 
from  20  pairs  of  observations  ;  test  its  significance. 

For  r=  --629  we  have,  using  either  a  table  of 
natural  logarithms,  or  the  special  table  for  z,  z  =  -  •  7398. 
To  divide  this  by  its  standard  error  is  equivalent  to 
multiplying  it  by  V17.  This  gives  -  3*050,  which  we 
interpret  as  a  normal  deviate.  From  the  table  of 
normal  deviates  it  appears  that  this  value  will  be 
exceeded  about  23  times  in  10,000  trials.  The  true 
frequency,  as  we  have  seen,  is  about  30  times  in 
10,000  trials.  The  error  tends  only  slightly  to 
exaggerate  the  significance  of  the  result. 

Ex.  30.  Further  test  of  the  normality  of  the 
distribution  of  z. — A  partial  correlation  +-457  was 
obtained  from  a  sample  of  32,  after  eliminating  two 
variates.  Does  this  differ  significantly  from  zero  ? 
Here  z  =*4935  ;  deducting  the  two  eliminated  variates 
the  effective  size  of  the  sample  is  30,  and  the  standard 
error  of  z  is  1/V27  ;  multiplying  z  by  V27,  we  have  as 
a  normal  variate  2-564.  Table  I.  (or  the  bottom  line 
of  Table  IV.)  shows,  as  before,  that  P  is  just  over  -oi. 
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There  is  a  slight  exaggeration  of  significance,  but  it  is 
even  slighter  than  in  the  previous  example. 

The  above  examples  show  that  the  z  transforma- 
tion will  give  a  variate  which,  for  most  practical  pur- 
poses, may  be  taken  to  be  normally  distributed.  In 
the  case  of  simple  tests  of  significance  the  use  of  the 
table  of  /  is  to  be  preferred  ;  in  the  following  examples 
this  method  is  not  available,  and  the  only  method 
available  which  is  both  tolerably  accurate  and  suffi- 
ciently rapid  for  practical  use  lies  in  the  use  of  z. 

Ex.  31.  Significance  of  deviation  from  expecta- 
tion of  an  observed  correlation  coefficient.  —  In  a 
sample  of  25  pairs  of  parent  and  child  the  correlation 
was  found  to  be  -6o.  Is  this  value  consistent  with  the 
view  that  the  true  correlation  in  that  character  was 
•46? 

The  first  step  is  to  find  the  difference  of  the  corre- 
sponding values  of  z.     This  is  shown  below  : 

table  33 


r. 

z. 

Sample  value    . 
Population  value 
Difference  . 

•60 
.46 

•693I 

•4973 

•1958 

To  obtain  the  normal  deviate  we  multiply  by  a/2  2, 
and  obtain  -918.  The  deviation  is  less  than  the 
standard  deviation,  and  the  value  obtained  is  therefore 
quite  in  accordance  with  the  hypothesis. 
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Ex.  32.  Significance  of  difference  between  two 
observed  correlations.  —  Of  two  samples  the  first, 
of  20  pairs,  gives  a  correlation  -6,  the  second,  of  25 
pairs,  gives  a  correlation  -8  :  are  these  values  signi- 
ficantly different  ? 

In  this  case  we  require  not  only  the  difference  of 
the  values  of  z,  but  the  standard  error  of  the  difference. 
The  variance  of  the  difference  is  the  sum  of  the 
reciprocals  of  17  and  22  ;  the  work  is  shown  below : 


TABLE  34 


r. 

z. 

«'-  3- 

Reciprocal. 

1st  sample 
2nd  sample 
Difference   . 

•GO 
•80 

•693I 
1-0986 

17 

22 

Sum   . 

•05882 
•04545 

•4055  f3230 

•IO427 

The  standard  error  which  is  appended  to  the 
difference  of  the  values  of  z  is  the  square  root  of  the 
variance  found  on  the  same  line.  The  difference  does 
not  exceed  twice  the  standard  error,  and  cannot  there- 
fore be  judged  significant.  There  is  thus  no  sufficient 
evidence  to  conclude  that  the  two  samples  are  not 
drawn  from  equally  correlated  populations. 

Ex.  33.  Combination  of  values  from  small  samples. 
— Assuming  that  the  two  samples  in  the  last  example 
were  drawn  from  equally  correlated  populations, 
estimate  the  value  of  the  correlation. 

The  two  values  of  z  must  be  given  weight  in- 
versely proportional  to  their  variance.     We  therefore 
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multiply  the  first  by  17,  the  second  by  22  and  add, 
dividing  the  total  by  39.  This  gives  an  estimated 
value  of  z  for  the  population,  and  the  corresponding 
value  of  r  may  be  found  from  the  table. 

TABLE  35 


r. 

z. 

m-3- 

{n  -  3)2. 

1st  sample 
2nd  sample     . 

•60 
•80 

•6930 
I -0986 

.  !7 

22 

II-78IO 
24'l692 

•7267 

•9218 

39 

35-9502 

The  weighted  average  value  of  z  is  -9218,  to  which 
corresponds  the  value  ^  =  -7267;  the  value  of  z  so 
obtained  may  be  regarded  as  subject  to  normally 
distributed  errors  of  random  sampling  with  variance 
equal  to  1/39.  The  accuracy  is  therefore  equivalent 
to  that  of  a  single  value  obtained  from  42  pairs  of 
observations.  Tests  of  significance  may  thus  be 
applied  to  such  averaged  values  of  z,  as  to  individual 
values. 

36.  Systematic  Errors 

In  connexion  with  the  averaging  of  correlations 
obtained  from  small  samples  it  is  worth  while  to 
consider  the  effects  of  two  classes  of  systematic  errors, 
which,  although  of  little  or  no  importance  when  single 
values  only  are  available,  become  of  increasing  im- 
portance as  larger  numbers  of  samples  are  averaged. 

The  value  of  z  obtained  from  any  sample  is  an 
estimate  of  a  true  value,  f,  belonging  to  the  sampled 
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population,  just  as  the  value  of  r  obtained  from  a 
sample  is  an  estimate  of  a  population  value,  p.  If  the 
method  of  obtaining  the  correlation  were  free  from 
bias,  the  values  of  z  would  be  normally  distributed 
about  a  mean  z,  which  would  agree  in  value  with  f. 
Actually  there  is  a  small  bias  which  makes  the  mean 
value  of  z  somewhat  greater  numerically  than  f ; 
thus  the  correlation,  whether  positive  or  negative,  is 
slightly  exaggerated.  This  bias  may  effectively  be 
corrected  by  subtracting  from  the  value  of  z  the 
correction 

P 

2(V  -  I)' 

For  single  samples  this  correction  is  unimportant, 
being  small  compared  to  the  standard  error  of  z.  For 
example,  if  ri  =  10,  the  standard  error  of  z  is  -378, 
while  the  correction  is  pi  18  and  cannot  exceed  -056. 
If,  however,  z  were  the  mean  of  1000  such  values  of  z, 
derived  from  samples  of  10,  the  standard  error  of  z 
is  only  -012,  and  the  correction,  which  is  unaltered  by 
taking  the  mean,  may  become  of  great  importance. 

The  second  type  of  systematic  error  is  that  intro- 
duced by  neglecting  Sheppard's  correction.  In  calcu- 
lating the  value  of  z,  we  must  always  take  the  value  of 
r  found  without  using  Sheppard's  correction,  since  the 
latter  complicates  the  distribution. 

But  the  omission  of  Sheppard's  correction  intro- 
duces a  systematic  error,  in  the  opposite  direction  to 
that  mentioned  above  ;  and  which,  though  normally 
very  small,  appears  in  large  as  well  as  in  small  samples. 
In  the  case  of  averaging  the  correlations  from  a  number 
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of  coarsely  grouped  small  samples,  the  average  z  should 
be  obtained  from  values  of  r  found  without  Sheppard's 
correction,  and  to  the  result  a  correction,  representing 
the  average  effect  of  Sheppard's  correction,  may  be 
applied. 

37.  Correlation  between  Series 

The  extremely  useful  case  in  which  it  is  required  to 
find  the  correlation  between  two  series  of  quantities, 
such  as  annual  figures,  arranged  in  order  at  equal 
intervals  of  time,  is  in  reality  a  case  of  partial  correla- 
tion, although  it  may  be  treated  more  directly  by  the 
method  of  fitting  curved  regression  lines  given  in  the 
last  chapter  (p.  124). 

If,  for  example,  we  had  a  record  of  the  number 
of  deaths  from  a  certain  disease  for  successive  years, 
and  wished  to  study  if  this  mortality  were  associated 
with  meteorological  conditions,  or  the  incidence  of 
some  other  disease,  or  the  mortality  of  some  other  age 
group,  the  outstanding  difficulty  in  the  direct  applica- 
tion of  the  correlation  coefficient  is  that  the  number 
of  deaths  considered  probably  exhibits  a  progressive 
change  during  the  period  available.  Such  changes 
may  be  due  to  changes  in  the  population  among  which 
the  deaths  occur,  whether  it  be  the  total  population 
of  a  district,  or  that  of  a  particular  age  group,  or 
to  changes  in  the  sanitary  conditions  in  which  the 
population  lives,  or  in  the  skill  and  availability  of 
medical  assistance,  or  to  changes  in  the  racial  or 
genetic  composition  of  the  population.  In  any  case 
it  is  usually  found  that  the  changes  are  still  apparent 
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when  the  number  of  deaths  is  converted  into  a  death- 
rate  on  the  existing  population  in  each  year,  by  which 
means  one  of  the  direct  effects  of  changing  population 
is  eliminated. 

If  the  progressive  change  could  be  represented 
effectively  by  a  straight  line  it  would  be  sufficient  to 
consider  the  time  as  a  third  variate,  and  to  eliminate 
it  by  calculating  the  corresponding  partial  correlation 
coefficient.  Usually,  however,  the  change  is  not  so 
simple,  and  would  need  an  expression  involving  the 
square  and  higher  powers  of  the  time  adequately  to 
represent  it.  The  partial  correlation  required  is  one 
found  by  eliminating  not  only  /,  but  t2,  /3,  Z4,  .  .  ., 
regarding  these  as  separate  variates  ;  for  if  we  have 
eliminated  all  of  these  up  to  (say)  the  fourth  degree, 
we  have  incidentally  eliminated  from  the  correlation 
any  function  of  the  time  of  the  fourth  degree,  including 
that  by  which  the  progressive  change  is  best  repre- 
sented. 

This  partial  correlation  may  be  calculated  directly 
from  the  coefficients  of  the  regression  function  obtained 
as  in  the  last  chapter  (p.  126).  If y  and/  are  the  two 
quantities  to  be  correlated,  we  obtain  for  y  the  co- 
efficients A,  B,  C,  .  .  .,  and  forj/  the  corresponding 
coefficients  A',  B',  C,  .  .  .  ;  the  sum  of  the  squares  of 
the  deviations  of  the  variates  from  the  curved  regres- 
sion lines  are  obtained  as  before,  from  the  equations 

SO  -  Y)2  =  S(yO  -  n'A*  -  n'(n'2  "  ^B2  - .  .  ., 
S(/  -  Y')2  =  S(j/ 2)  -  n'A'2  -  n'(n'2  "  ^B'2  - .  .  .; 


THE  CORRELATION  COEFFICIENT  175 

while  the  sum  of  the  products  may  be  obtained  from 
the  similar  equation 

S{(y  -  Y)(y/  -  Y')}  =Sfo/)  -tt'AA'  -n'^  °BB'  -.  .  ., 

the  required  partial  correlation  being,  then, 

„_     S{(y-Y)(y-YQ} 

VS(y-Y)2.S(/-Y')2' 

In  this  process  the  number  of  variates  eliminated 
is  equal  to  the  degree  of  t  to  which  the  fitting  has  been 
carried  ;  it  will  be  understood  that  both  variates  must 
be  fitted  to  the  same  degree,  even  if  one  of  them  is 
capable  of  adequate  representation  by  a  curve  of  lower 
degree  than  is  the  other. 
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TABLE  V.A. — Values  of  the  Correlation  Coefficient 

FOR  DIFFERENT   LEVELS   OF   SIGNIFICANCE 


n 

P=i. 

■OS- 

•02. 

•01. 

I 

•98769 

.996917 

•9995066 

•9998766 

2 

•90000 

•95000 

•98000 

•990000 

3 

•8054 

•8783 

•93433 

•95873 

4 

.7293 

•8114 

•8822 

•91720 

5 

•6694 

•7545 

•8329 

•8745 

6 

•6215 

•7067 

•7887 

•8343 

7 

•5822 

•6664 

•7498 

•7977 

8 

•5494 

•6319 

•7155 

•7646 

9 

•5214 

•6021 

•6851 

•7348 

IO 

•4973 

•5760 

•6581 

•7079 

ii 

•4762 

•5529 

•6339 

•6835 

12 

•4575 

•5324 

•6120 

•6614 

13 

•4409 

•5139 

•5923 

•641 1 

14 

•4259 

•4973 

•5742 

•6226 

15 

•4124 

•4821 

•5577 

•6055 

16 

•4000 

•4683 

•5425 

•5897 

17 

•3887 

•4555 

•5285 

•575i 

18 

•3783 

•4438 

•5155 

•5614 

19 

•3687 

•4329 

•5034 

•5487 

20 

•3598 

•4227 

•4921 

•5368 

25 

•3233 

•3809 

•4451 

•4869 

30 

•2960 

•3494 

•4093 

•4487 

35 

•2746 

•3246 

•3810 

•4182 

40 

•2573 

•3044 

•3578 

•3932 

45 

•2428 

•2875 

•3384 

•3721 

5o 

•2306 

•2732 

•3218 

•3541 

60 

•2108 

•2500 

•2948 

•3248 

70 

•1954 

•2319 

•2737 

•3017 

80 

•1829 

•2172 

•2565 

•2830 

90 

•1726 

•2050 

•2422 

•2673 

100 

•1638 

•1946 

•2301 

•2540 

For  a  total  correlation,  n  is  2  less  than  the  number  of  pairs  in  the 
sample  ;  for  a  partial  correlation,  the  number  of  eliminated  variates  also 
should  be  subtracted. 
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INTRACLASS   CORRELATIONS  AND   THE 
ANALYSIS    OF   VARIANCE 


38.  A  type  of  data,  which  is  of  very  common  occur- 
rence, may  be  treated  by  methods  closely  analogous 
to  that  of  the  correlation  table,  while  at  the  same  time 
it  may  be  more  usefully  and  accurately  treated  by  the 
analysis  of  variance,  that  is  by  the  separation  of  the 
variance  ascribable  to  one  group  of  causes,  from  the 
variance  ascribable  to  other  groups.  We  shall  in  this 
chapter  treat  first  of  those  cases,  arising  in  biometry,  in 
which  the  analogy  with  the  correlations  treated  in  the 
last  chapter  may  most  usefully  be  indicated,  and  then 
pass  to  more  general  cases,  prevalent  in  experimental 
results,  in  which  the  treatment  by  correlation  appears 
artificial,  and  in  which  the  analysis  of  variance  appears 
to  throw  a  real  light  on  the  problems  before  us.  A 
comparison  of  the  two  methods  of  treatment  illustrates 
the  general  principle,  so  often  lost  sight  of,  that  tests  of 
significance,  in  so  far  as  they  are  accurately  carried 
out,  are  bound  to  agree,  whatever  process  of  statistical 
reduction  may  be  employed. 

178 
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If  we  have  measurements  of  n'  pairs  of  brothers, 
we  may  ascertain  the  correlation  between  brothers  in 
two  slightly  different  ways.  In  the  first  place  we  may 
divide  the  brothers  into  two  classes,  as  for  instance 
elder  brother  and  younger  brother,  and  find  the  corre- 
lation between  these  two  classes  exactly  as  we  do  with 
parent  and  child.  If  we  proceed  in  this  manner  we 
shall  find  the  mean  of  the  measurements  of  the  elder 
brothers,  and  separately  that  of  the  younger  brothers. 
Equally  the  standard  deviations  about  the  mean  are 
found  separately  for  the  two  classes.  The  correlation 
so  obtained,  being  that  between  two  classes  of  measure- 
ments, is  termed  for  distinctness  an  interclass  correla- 
tion. Such  a  procedure  would  be  imperative  if  the 
quantities  to  be  correlated  were,  for  example,  the 
ages,  or  some  characteristic  sensibly  dependent  upon 
age,  at  a  fixed  date.  On  the  other  hand,  we  may  not 
know,  in  each  case,  which  measurement  belongs  to  the 
elder  and  which  to  the  younger  brother,  or,  such  a 
distinction  may  be  quite  irrelevant  to  our  purpose  ;  in 
such  cases  it  is  usual  to  use  a  common  mean  derived 
from  all  the  measurements,  and  a  common  standard 
deviation  about  that  mean.  If  x1}  x\  ;  x2,  x\  ;  .  .  .  ; 
xH>)  x'n>  are  the  pairs  of  measurements  given,  we 
calculate 

x  =  —  S(x  +x'), 
2n 

s*  =— ,{s(*  -xy  +  s<y  -#■}, 

r=^j2S{(X-x)(X'-x)}' 
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When  this  is  done,  r  is  distinguished  as  an  intra- 
class  correlation,  since  we  have  treated  all  the  brothers 
as  belonging  to  the  same  class,  and  having  the  same 
mean  and  standard  deviation.  The  intraclass  correla- 
tion, when  its  use  is  justified  by  the  irrelevance  of  any 
such  distinction  as  age,  may  be  expected  to  give  a  more 
accurate  estimate  of  the  true  value  than  does  any  of 
the  possible  interclass  correlations  derived  from  the 
same  material,  for  we  have  used  estimates  of  the  mean 
and  standard  deviation  founded  on  2n'  instead  of 
on  n'  values.  This  is  in  fact  found  to  be  the  case  ; 
the  intraclass  correlation  is  not  an  estimate  equivalent 
to  an  interclass  correlation,  but  is  somewhat  more 
accurate.  The  error  distribution  is,  however,  as  we 
shall  see,  affected  also  in  other  ways,  which  require  the 
intraclass  correlation  to  be  treated  separately. 

The  analogy  of  this  treatment  with  that  of  inter- 
class correlations  may  be  further  illustrated  by  the 
construction  of  what  is  called  a  symmetrical  table. 
Instead  of  entering  each  pair  of  observations  once 
in  such  a  correlation  table,  it  is  entered  twice,  the 
co-ordinates  of  the  two  entries  being,  for  instance 
(xlt  x\)  and  (x\,  xt).  The  total  entries  in  the  table 
will  then  be  2n,  and  the  two  marginal  distributions 
will  be  identical,  each  representing  the  distribution  of 
the  whole  2n  observations.  The  above  equations, 
for  calculating  the  intraclass  correlation,  bear  the  same 
relation  to  the  symmetrical  table  as  the  equations  for 
the  interclass  correlation  bear  to  the  corresponding 
unsymmetrical  table  with  ri  entries.  Although  the 
intraclass  correlation  is  somewhat  the  more  accurate, 
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it  is  by  no  means  so  accurate  as  is  an  interclass  correla- 
tion with  2n   independent  pairs  of  observations. 

The  contrast  between  the  two  types  of  correlation 
becomes  more  obvious  when  we  have  to  deal  not  with 
pairs,  but  with  sets  of  three  or  more  measurements  ; 
for  example,  if  three  brothers  in  each  family  have  been 
measured.  In  such  cases  also  a  symmetrical  table 
can  be  constructed.  Each  trio  of  brothers  will  pro- 
vide three  pairs,  each  of  which  gives  two  entries,  so 
that  each  trio  provides  6  entries  in  the  table.  To 
calculate  the  correlation  from  such  a  table  is  equivalent 
to  the  following  equations  : 

x  =  —  S(x  +  x  +  x"), 
in 

s*  =±.s{(x-xy+(x'  -xy+(x"  -xy\, 

r  =  — ^S  {(*'  -  *)(**  "  *)  +  0"  "  *X*  -*)+(*-  *)(*'  -  *)}  • 

In  many  instances  of  the  use  of  intraclass  cor- 
relations the  number  of  observations  in  the  same 
"  fraternity  "  or  class,  is  large,  as  when  the  resemblance 
between  leaves  on  the  same  tree  is  studied  by  picking 
26  leaves  from  a  number  of  different  trees,  or  when 
100  pods  are  taken  from  each  tree  in  another  group 
of  correlation  studies.  If  k  is  the  number  in  each  class, 
then  each  set  of  k  values  will  provide  k{k  -  1)  values 
for  the  symmetrical  table,  which  thus  may  contain  an 
enormous  number  of  entries,  and  be  very  laborious  to 
construct.  To  obviate  this  difficulty  Harris  introduced 
an  abbreviated  method  of  calculation  by  which  the 
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value  of  the  correlation  given  by  the  symmetrical  table 
may  be  obtained  directly  from  two  distributions  : 
(i.)  the  distribution  of  the  whole  group  of  kri  observa- 
tions, from  which  we  obtain,  as  above,  the  values  of 
x  and  s  ;  (ii.)  the  distribution  of  the  ri  means  of  classes. 
If  xlf  x2,  .  .  .,  xn'  represent  these  means  each  derived 
from  k  values,  then 

kS(xp-xy=n's*{i  +(k-i)r} 

is  an  equation  from  which  can  be  calculated  the  value 
of  r,  the  intraclass  correlation  derived  from  the  sym- 
metrical table.  It  is  instructive  to  verify  this  fact, 
for  the  case  k  =3,  by  deriving  from  it  the  full  formula 
for  r  given  above  for  that  case. 

One  salient  fact  appears  from  the  above  relation  ; 
for  the  sum  of  a  number  of  squares,  and  therefore 
the  left  hand  of  this  equation,  is  necessarily  positive. 
Consequently  r  cannot  have  a  negative  value  less  than 

-  \\{k  -  1).  There  is  no  such  limitation  to  positive 
values,  all  values  up  to  +1  being  possible.  Further, 
if  ky  the  number  in  any  class,  is  not  necessarily  less 
than  some  fixed  value,  the  correlation  in  the  population 
cannot  be  negative  at  all.  For  example,  in  card  games, 
where  the  number  of  suits  is  limited  to  four,  the  corre- 
lation between  the  number  of  cards  in  different  suits 
in  the  same  hand  may  have  negative  values  down  to 

-  \  ;  but  there  is  probably  nothing  in  the  production  of 
a  leaf  or  a  child  which  necessitates  that  the  number  in 
such  a  class  should  be  less  than  any  number  however 
great,  and  in  the  absence  of  such  a  necessary  restric- 
tion we  cannot  expect  to  find  negative  correlations 
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within  such  classes.  This  is  in  the  sharpest  contrast 
to  the  unrestricted  occurrence  of  negative  values 
among  interclass  correlations,  and  it  is  obvious,  since 
the  extreme  limits  of  variation  are  different  in  the  two 
cases,  that  the  distribution  of  values  in  random  samples 
must  be  correspondingly  modified. 

39.  Sampling  Errors  of  Intraclass  Correlations 

The  case  k  =  2,  which  is  closely  analogous  to  an 
interclass  correlation,  may  be  treated  by  the  trans- 
formation previously  employed,  namely 

z  is  then  distributed  very  nearly  in  a  normal  distribu- 
tion, the  distribution  is  wholly  independent  of  the 
value  of  the  correlation  p  in  the  population  from  which 
the  sample  is  drawn,  and  the  variance  of  z  conse- 
quently depends  only  on  the  size  of  the  sample,  being 
given  by  the  formula 

5  *'-3/2 
The  transformation  has,  therefore,  the  same  ad- 
vantages in  this  case  as  for  interclass  correlations. 
It  will  be  observed  that  the  slightly  greater  accuracy 
of  the  intraclass  correlation,  compared  to  an  interclass 
correlation  based  on  the  same  number  of  pairs,  is 
indicated  by  the  use  of  n'  -  3/2  in  place  of  n'  -  3.  The 
advantage  is,  therefore,  equivalent  to  ij  additional 
pairs  of  observations.  A  second  difference  lies  in  the 
bias  to  which  such  estimates  are  subject.     For  inter- 
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class  correlations  the  value  found  in  samples,  whether 
positive  or  negative,  is  exaggerated  to  the  extent  of 
requiring  a  correction, 

P 
2(*'-l)' 

to  be  applied  to  the  average  value  of  z.  With  intra- 
class  correlations  the  bias  is  always  in  the  negative 
direction,   and   is   independent   of  p  ;     the   correction 

necessary    in     these    cases     being     +  |log  — ,    or, 

n'  -  i 

approximately,    + .      This  bias  is  characteristic 

2n  -  i 

of   intraclass    correlations    for   all    values    of  k,    and 

arises  from  the  fact  that  the  symmetrical  table  does 

not  provide  us   with   quite  the   best  estimate  of  the 

correlation. 

The  effect  of  the  transformation   upon  the  error 

curves  may  be  seen  by  comparing   Figs.  9  and   10. 

Fig.  9  shows  the  actual  error  curves  of  r  derived  from  a 

symmetrical  table  formed  from  8  pairs  of  observations, 

drawn  from  populations  having  correlation  o  and  o-8. 

Fig.  10  shows  the  corresponding  error  curves  for  the 

distribution  of  g.     The  three  chief  advantages  noted 

in  Figs.  7  and  8  are  equally  visible  in  the  comparison 

of  Figs.  9  and  10.     Curves  of  very  unequal  variance 

are  replaced  by  curves  of  equal  variance,  skew  curves 

by  approximately  normal  curves,  curves  of  dissimilar 

form  by  curves  of  similar  form.     In  one  respect  the 

effect  of  the  transformation   is   more  perfect  for  the 

intraclass  than  it  is  for  the  interclass  correlations,  for, 

although  in  both  cases  the  curves  are  not  precisely 
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normal,    with    the    intraclass    correlations    they    are 
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entirely  constant  in  variance  and  form,  whereas  with 
interclass  correlations  there  is  a  slight  variation  in  both 
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respects,  as  the  correlation  in  the  population  is  varied. 
Fig.  10  shows  clearly  the  effect  of  the  bias  introduced 
in  estimating  the  correlation  from  the  symmetrical 
table  ;  the  bias,  like  the  other  features  of  these  curves, 
is  absolutely  constant  in  the  scale  of  2. 

Ex.  34.  Accuracy  of  an  observed  intraclass  cor- 
relation.— An  intraclass  correlation  -6ooo  is  derived 
from  13  pairs  of  observations  :  estimate  the  correlation 
in  the  population  from  which  it  was  drawn,  and  find 
the  limits  within  which  it  probably  lies. 

Placing  the  values  of  r  and  z  in  parallel  columns, 
we  have 

TABLE  36 


r. 

ft. 

Calculated  value  . 

+  •6000 

+     -6930 

Correction  .... 

+     '0400 

Estimate      .... 

+  •6249 

+     '7330 

Standard  error 

±     '2949 

Upper  limit 

+  •8675 

+  I-3228 

Lower  limit 

+  •1421 

+     -1432 

The  calculation  is  carried  through  in  the  z  column, 
and  the  corresponding  values  of  r  found  as  required 
from  the  Table  V.B  (p.  177).  The  value  of  r  is 
obtained  from  the  symmetrical  table,  and  the  corre- 
sponding value  of  z  calculated.  These  values  suffer 
from  a  small  negative  bias,  and  this  is  removed  by 
adding  to  z  the  correction  ;  the  unbiassed  estimate  of  z 
is  therefore  -7330,  and  the  corresponding  value  of  r, 
•6249,  is  an  unbiassed  estimate,  based  upon  the  sample, 
of  the  correlation  in  the  population  from  which  the 
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sample  was  drawn.  To  find  the  limits  within  which  this 
correlation  may  be  expected  to  lie,  the  standard  error 
of  z  is  calculated,  and  twice  this  value  is  added  and 
subtracted  from  the  estimated  value  to  obtain  the 
values  of  z  at  the  upper  and  lower  limits.  From 
these  we  obtain  the  corresponding  values  of  r.  The 
observed  correlation  must  in  this  case  be  judged 
significant,  since  the  lower  limit  is  positive  ;  we  shall 
seldom  be  wrong  in  concluding  that  it  exceeds  -14 
and  is  less  than  -87.    ' 

The  sampling  errors  of  the  cases  in  which  k 
exceeds  2  may  be  more  satisfactorily  treated  from  the 
standpoint  of  the  analysis  of  variance  ;  but  for  those 
cases  in  which  it  is  preferred  to  think  in  terms  of 
correlation,  it  is  possible  to  give  an  analogous  trans- 
formation suitable  for  all  values  of  k.     Let 


,-tiogi±az£:, 


I  -r 

a  transformation,  which  reduces  to  the  form  previously 
used  when  k  =  2.  Then,  in  random  samples  of  sets  of 
k  observations  the  distribution  of  errors  in  z  is  inde- 
pendent of  the  true  value,  and  approaches  normality 
as  n'  is  increased,  though  not  so  rapidly  as  when  k  =  2. 
The  variance  of  z  may  be  taken,  when  n'  is  sufficiently 
large,  to  be  approximately 

k 

2(k  -  l)(V  -2)' 

To  find  r  for  a  given  value  of  z  in  this  transforma- 
tion, Table  V.B  may  still  be  utilised,  as  in  the 
following  example. 
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Ex.  35.  Extended  use  of  Table  V.B. — Find  the 
value  of  r  corresponding  to  z  =  +  1-0605,  when  k  =  100. 

First  deduct  from  the  given  value  of  z  half  the 
natural  logarithm  of  {k  -  1)  ;  enter  the  difference  as 
"  z  "  in  the  table  and  multiply  the  corresponding 
value  of  ' '  r"  by  k  ;  add  k  -  2  and  divide  by  2{k  -  1). 
The  numerical  work  is  shown  below : 

TABLE  37 

z +1-0605 

Jlog(/&-i)  =  £logQQ      .         .         .  2-2975 

"  z " -  1-2370 

"f" --8446 

i"r"=ioo"r"        ....  -84-46 

k-2 98 

2r(k-\)=\<)%r         ....         13-54 
r +-0684 

Ex.  36.  Significance  of  in  trad  ass  correlation 
from  large  samples. — A  correlation  +  -0684  was  found 
between  the  "  ovules  failing  "  in  the  different  pods 
from  the  same  tree  of  Cercis  Canadensis.  100  pods 
were  taken  from  each  of  60  trees  (Harris's  data). 
Is  this  a  significant  correlation  ? 

As  the  last  example  shows,  z  =  1-0605  J  tne  standard 
error  of  z  is  -0933.  The  value  of  z  exceeds  its  standard 
error  over  1 1  times,  and  the  correlation  is  undoubtedly 
significant. 

When  n'  is  sufficiently  large  we  have  seen  that, 
subject  to  somewhat  severe  limitations,  it  is  possible 
to  assume  that  the  interclass  correlation  is  normally 
distributed  in  random  samples  with  standard  error 


V '  ri  - 
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The  corresponding  formula  for  intraclass  correlations, 
using  k  in  a  class,  is 

(i-P){i+(£-i)p} 

The  utility  of  this  formula  is  subject  to  even  more 
drastic  limitations  than  is  that  for  the  interclass  correla- 
tion, for  n'  is  more  often  small  in  the  former  case.  In 
addition,  the  regions  for  which  the  formula  is  inappli- 
cable, even  when  n'  is  large,  are  now  not  in  the  neigh- 
bourhood of  ±1,  but  in  the  neighbourhood  of  +  1  and 

-•y .     When  k  is  large  the  latter  approaches  zero, 

rZ         I 

so  that  an  extremely  skew  distribution  for  r  is  found 
not  only  with  high  correlations  but  also  with  very  low 
ones.  It  is  therefore  not  usually  an  accurate  formula 
to  use  in  testing  significance.  This  abnormality  in 
the  neighbourhood  of  zero  is  particularly  to  be  noticed, 
since  it  is  only  in  this  neighbourhood  that  much  is  to 
be  gained  by  taking  high  values  of  k.  Near  zero,  as 
the  above  formula  shows,  the  accuracy  of  an  intraclass 
correlation  is  with  large  samples  equivalent  to  that  of 
\k{k  -  i)ri  independent  pairs  of  observations  ;  which 
gives  to  high  values  of  k  an  enormous  advantage  in 
accuracy.  For  correlations  near  -5,  however  great  k 
be  made,  the  accuracy  is  no  higher  than  that  obtain- 
able from  gnj  2  pairs  ;  while  near  +  1  it  tends  to  be 
no  more  accurate  than  would  be  ri  pairs. 
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40.  Intraclass  Correlation  as  an  Example  of  the  Analysis 

of  Variance 

A  very  great  simplification  is  introduced  into 
questions  involving  intraclass  correlation  when  we 
recognise  that  in  such  cases  the  correlation  merely 
measures  the  relative  importance  of  two  groups  of 
factors  causing  variation.  We  have  seen  that  in  the 
practical  calculation  of  the  intraclass  correlation  we 
merely  obtain  the  two  necessary  quantities  kns2  and 
n's2{i  +  {k  -  i)r},  by  equating  them  to  the  two 
quantities 

so-*)2,  *h*,-*Y> 
i  i 

of  which  the  first  is  the  sum  of  the  squares  (kri  in 
number)  of  the  deviations  of  all  the  observations  from 
their  general  mean,  and  the  second  is  k  times  the  sum 
of  the  squares  of  the  n  deviations  of  the  mean  of 
each  class  from  the  general  mean.  Now  it  may 
easily  be  shown  that 

S(x  -  x)2  =  kS(xp  -x)2  +  S(x  -  xp)2, 

in  which  the  last  term  is  the  sum  of  the  squares  of  the 
deviations  of  each  individual  measurement  from  the 
mean  of  the  class  to  which  it  belongs.  The  following 
table  summarises  these  relations  by  showing  the  number 
of  degrees  of  freedom  involved  in  each  case,  and,  in  the 
last  column,  the  interpretation  put  upon  each  expres- 
sion in  the  calculation  of  an  intraclass  correlation  to 
correspond  to  the  value  obtained  from  a  symmetrical 
table. 
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TABLE  38 


Degrees  of 
Freedom. 

•    Sum  of  Squares. 

Within  classes     . 

Between  classes  . 

Total      . 

n\k-\) 
ri  —  1 

kn' 

kl{xp-x)* 

1 

n's*(6-i)(i-r) 
n's2{i  +  (k-i)r} 

n'k—  1 

kn' 

S{x-xY 

n's^k 

It  will  now  be  observed  that  z  of  the  preceding 
section  is,  apart  from  a  constant,  half  the  difference  of 
the  logarithms  of  the  two  parts  into  which  the  sum  of 
squares  has  been  analysed.  The  fact  that  the  form 
of  the  distribution  of  z  in  random  samples  is  inde- 
pendent of  the  correlation  of  the  population  sampled, 
is  thus  a  consequence  of  the  fact  that  deviations  of 
the  individual  observations  from  the  means  of  their 
classes  are  independent  of  the  deviations  of  those 
means  from  the  general  mean.  The  data  provide  us 
with  independent  estimates  of  two  variances  ;  if  these 
variances  are  equal  the  correlation  is  zero  ;  if  our 
estimates  do  not  differ  significantly  the  correlation 
is  insignificant.  If,  however,  they  are  significantly 
different,  we  may  if  we  choose  express  the  fact  in  terms 
of  a  correlation. 

The  interpretation  of  such  an  inequality  of  variance 
in  terms  of  a  correlation  may  be  made  clear  as  follows, 
by  a  method  which  also  serves  to  show  that  the  inter- 
pretation made  by  the  use  of  the  symmetrical  table  is 
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slightly  defective.  Let  a  quantity  be  made  up  of  two 
parts,  each  normally  and  independently  distributed  ; 
let  the  variance  of  the  first  part  be  A,  and  that  of  the 
second  part,  B  ;  then  it  is  easy  to  see  that  the  variance 
of  the  total  quantity  is  A  +  B.  Consider  a  sample  of 
ri  values  of  the  first  part,  and  to  each  of  these  add  a 
sample  of  k  values  of  the  second  part,  taking  a  fresh 
sample  of  k  in  each  case.  We  then  have  ri  classes  of 
values  with  k  in  each  class.  In  the  infinite  population 
from  which  these  are  drawn  the  correlation  between 
pairs  of  numbers  of  the  same  class  will  be 


P  = 


A  +  B 


From  such  a  set  of  kri  values  we  may  make 
estimates  of  the  values  of  A  and  B,  or  in  other  words 
we  may  analyse  the  variance  into  the  portions  contri- 
buted by  the  two  causes  ;  the  intraclass  correlation 
will  be  merely  the  fraction  of  the  total  variance  due 
to  that  cause  which  observations  in  the  same  class 
have  in  common.  The  value  of  B  may  be  estimated 
directly,  for  variation  within  each  class  is  due  to  this 
cause  alone,  consequently 

S(x-xp)2  =  n'(k-i)B. 

The  mean  of  the  observations  in  any  class  is  made 
up  of  two  parts,  the  first  part  with  variance  A,  and  a 
second  part,  which  is  the  mean  of  k  values  of  the 
second  parts  of  the  individual  values,  and  has  there- 
fore a  variance  B//&  ;    consequently  from  the  observed 
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variation  of  the  means  of  the  classes,  we  have 

kS(xp  -x)2=(ri  -  i)(£A  +B). 

Table  38  may  therefore  be  rewritten,  writing  in  the 
last  column  s2  for  A  +  B,  and  r  for  the  unbiassed 
estimate  of  the  correlation. 

TABLE  39 


Degrees  of 
Freedom. 

Sum  of 
Squares. 

Within 
classes 

Between 
classes 

n'(k-  1) 
ri  -  I 

kn' 

kk{xv-xf 
1 

ri(k-i)B=ris\k-i){i-r) 
(n'-i)(kA  +  B)=(n'-i)s2{i  +  (k-  i)r} 

Total 

rik  -  1 

kn 

S(x-x)2 
1 

(»'-  l)kh  +  {rik-  l)K=s*{n'k-  l-{k-  \)r) 

Comparing  the  last  column  with  that  of  Table  38 

it  is  apparent  that  the  difference  arises  solely  from 

putting   ri  -  1    in    place   of  ri    in    the   second    line ; 

the  ratio  between  the  sums  of  squares  is  altered  in 

the  ratio  ri  :  (ri  -  1),   which   precisely  eliminates  the 

negative  bias  observed  in  2  derived  by  the  previous 

method.      The    error    of   that    method    consisted    in 

assuming  that  the  total  variance  derived  from  ri  sets 

of  related  individuals  could  be  accurately  estimated  by 

equating  the  sum  of  squares  of  all  the  individuals  from 

their  mean,  to  ris2k  ;   this  error  is  unimportant  when  ri 

is  large,  as  it  usually  is  when  k  =  2,  but  with  higher 

values  of  k}  data  may  be  of  great  value  even  when  ri 

is  very  small,  and  in  such  cases  serious  discrepancies 

arise  from  the  use  of  the  uncorrected  values. 

o 
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The  direct  test  of  the  significance  of  an  intraclass 
correlation  may  be  applied  to  such  a  table  of  the 
analysis  of  variance  without  actually  calculating  r. 
If  there  is  no  correlation,  then  A  is  not  significantly 
different  from  zero  ;  there  is  no  difference  between  the 
several  classes  which  is  not  accounted  for,  as  a  random 
sampling  effect  of  the  difference  within  each  class.  In 
fact  the  whole  group  of  observations  is  a  homogeneous 
group  with  variance  equal  to  B. 


41.  Test  of  Significance  of  Difference  of  Variance 

The  test  of  significance  of  intraclass  correlations 
is  thus  simply  an  example  of  the  much  wider  class  of 
tests  of  significance  which  arise  in  the  analysis  of 
variance.  These  tests  are  all  reducible  to  the  single 
problem  of  testing  whether  one  estimate  of  variance 
derived  from  nx  degrees  of  freedom  is  significantly 
greater  than  a  second  such  estimate  derived  from  n2 
degrees  of  freedom.  This  problem  is  reduced  to  its 
simplest  form  by  calculating  z  equal  to  half  the 
difference  of  the  natural  logarithms  of  the  estimates 
of  the  variance,  or  to  the  difference  of  the  logarithms 
of  the  corresponding  standard  deviations.  Then  if  P 
is  the  probability  of  exceeding  this  value  by  chance,  it 
is  possible  to  calculate  the  value  of  z  corresponding 
to  different  values  of  P,  nlf  and  n2. 

A  full  table  of  this  kind,  involving  three  variables, 
would  be  very  extensive  ;  we  therefore  give  tables 
for  the  important  regions  P  =-05  and  P  =-oi,  and  for 
a  number  of  combinations  of  nx  and  n2  sufficient  to 
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indicate  the  values  for  other  combinations  (Table  VI., 
p.  212).  We  shall  give  various  examples  of  the  use 
of  this  table.  When  both  nx  and  n2  are  large,  and 
also  for  moderate  values  when  they  are  equal  or 
nearly  equal,  the  distribution  of  z  is  sufficiently  nearly 
normal  for  effective  use  to  be  made  of  its  standard 
deviation,  which  may  be  written 


This  includes  the  case  of  the  intraclass  correlation, 
when  k  =  2,  for  if  we  have  n'  pairs  of  values,  the  varia- 
tion between  classes  is  based  on  ri  -  1  degrees  of 
freedom,  and  that  within  classes  is  based  on  n'  degrees 
of  freedom,  so  that 


n,  =n   -  1,     n9  =n 


and  for  moderately  large  values  of  n'  we  may  take  z 
to  be  normally  distributed  as  above  explained.  When 
k  exceeds  2,  we  have 

nx  =n  -  1,     n2  ={k  -  \)ri  ; 

these  may  be  very  unequal,  so  that  unless  n'  be  quite 
large,  the  distribution  of  z  will  be  perceptibly  asym- 
metrical, and  the  standard  deviation  will  not  provide 
a  satisfactory  test  of  significance. 

Ex.  37.  Sex  difference  in  variance  of  stature. — 
From  1 164  measurements  of  males  the  sum  of  squares 
of  the  deviations  was  found  to  be  8493  ;  while  from 
1456  measurements  of  females  it  was  9747  :  is  there  a 
significant  difference  in  absolute  variability  ? 
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TABLE  40 


Degrees  of 
Freedom. 

Sum  of 
Squares. 

Mean  Square. 

Log  (Mean 
Square). 

1/*. 

Men 
Women  . 

1 163 

1455 

8493 

9747 

7-303 
6-699 

Differe 

1-9883 
P9OI9 

•0008599 
•0006873 

nee -0864 

Sum  '001547 

The  mean  squares  are  calculated  from  the  sums  of 
squares  by  dividing  by  the  degrees  of  freedom  ;  the 
difference  of  the  logarithms  is  -0864,  so  that  z  is  -0432. 
The  variance  of  z  is  half  the  sum  of  the  last  column,  so 
that  the  standard  deviation  of  z  is  -02781.  The  differ- 
ence in  variability,  though  suggestive  of  a  real  effect, 
cannot  be  judged  significant  on  these  data. 

Ex.  38.  Homogeneity  of  small  samples.  —  In  an 
experiment  on  the  accuracy  of  counting  soil  bacteria, 
a  soil  sample  was  divided  into  four  parallel  samples, 
and  from  each  of  these  after  dilution  seven  plates  were 
inoculated.  The  number  of  colonies  on  each  plate  is 
shown  below.  Do  the  results  from  the  four  samples 
agree  within  the  limits  of  random  sampling  ?  In 
other  words,  is  the  whole  set  of  28  values  homogeneous, 
or  is  there  any  perceptible  intraclass  correlation  ? 
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TABLE  41 


Plate. 

Sample. 

I. 

11. 

in. 

IV. 

I 

72 

74 

78 

69 

2 

69 

72 

74 

67 

3 

63 

70 

70 

66 

4 

59 

69 

58 

64 

5 

59 

66 

58 

62 

6 

53 

58 

56 

58 

7 
Mean     . 

5i 

52                  56 

54 

6o-86 

65-86 

64-28 

62-86 

From  these  values  we  obtain 

TABLE  42 


Degrees  of 
Freedom. 

Sum  of 
Squares. 

Mean 
Square. 

S.D. 

Log  S.D. 

Within  classes  . 

Between  classes 

Total    . 

24 

3 

1446 

94-86 

60-25 
31-62 

7-762 
5-623 

2-0493 

1-7269 

27 

1540-9 

57-07 

7*55 
(Differ 

-•3224 

ence)  =  2" 

The  variation  within  classes  is  actually  the  greater, 
so  that  if  any  correlation  is  indicated  it  must  be 
negative.  The  numbers  of  degrees  of  freedom  are 
small  and  unequal,  so  we  shall  use  Table  VI.  This 
is  entered  with  nx  equal  to  the  degrees  of  freedom 
corresponding  to  the  larger  variance,  in  this  case  24  ; 
also,  n2  =3.  The  table  gives  1-0781  for  the  5  per  cent 
point ;    so  that  the  observed  difference,  -3224,  is  really 
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very  moderate,  and  quite  insignificant.  The  whole  set 
of  28  values  appears  to  be  homogeneous  with  variance 
about  57-07. 

It  should  be  noticed  that  if  only  two  classes  had 
been  present,  the  test  in  the  above  example  would  have 
been  equivalent  to  testing  the  significance  of  /,  as  ex- 
plained in  Chapter  V.  In  fact  the  values  for  n±  =  1  in 
the  table  of  z  (p.  212)  are  nothing  but  the  logarithms  of 
the  values,  for  P  =-05  and  -oi,  in  the  table  of/  (p.  139). 
Similarly  the  values  for  n2  =  1  in  Table  VI.  are  the 
logarithms  of  the  reciprocals  of  the  values,  which  would 
appear  in  Table  IV.  under  P=,95  and  -99.  The 
present  method  may  be  regarded  as  an  extension  of 
the  method  of  Chapter  V.,  appropriate  when  we 
wish  to  compare  more  than  two  means.  Equally  it 
may  be  regarded  as  an  extension  of  the  methods  of 
Chapter   IV.,   for  if  n2   were   infinite  z  would   equal 

y2 
\  log  x    of  Table   III.   for  P=o5  and  -oi,  and  if  nx 

2 

were  infinite  it  would  equal    -1  log  *-  for  P  =-95  and 

n 

•99.  Tests  of  goodness  of  fit,  in  which  the  sampling 
variance  is  not  calculable  a  priori,  but  may  be  esti- 
mated from  the  data,  may  therefore  be  made  by 
means  of  Table  VI. 

Ex.  39.  Comparison  of  intraclass  correlations. — 
The  following  correlations  are  given  (Harris's  data) 
for  the  number  of  ovules  in  different  pods  of  the  same 
tree,  100  pods  being  counted  on  each  tree  (Cercis 
Canadensis)  : 

Meramec  Highlands    .        .    60  trees    +-3527 
Lawrence,  Kansas       .        .    22  trees    +-3999 
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Is  the  correlation  at  Lawrence  significantly  greater 
than  that  in  the  Meramec  Highlands  ? 

First  we   find  z  in   each   case   from    the  formula 

^=Hlog  0  +99r)-log(i  -r)} 

(p.  187);  this  gives  z  =  2-008 1  for  Meramec  and  2-1071 
for  Lawrence ;  since  these  were  obtained  by  the  method 
of  the  symmetrical  table  we  shall  insert  the  small 
correction  \\(2n  -  1)  and  obtain  2-0165  for  Meramec, 
and  2-1304  for  Lawrence,  as  the  values  which  would 
have  been  obtained  by  the  method  of  the  analysis  of 
variance. 

To  ascertain  to  what  errors  these  determinations 
are  subject,  consider  first  the  case  of  Lawrence,  which 
being  based  on  only  22  trees  is  subject  to  the  larger 
errors.  We  have  nx=2\,  ^2=22x99  =  2178.  These 
values  are  not  given  in  the  table,  but  from  the 
value  for  nx  =  24,  n2  =  oc  it  appears  that  positive 
errors  exceeding  -2085  will  occur  in  rather  more 
than  5  per  cent  of  samples.  This  fact  alone  settles 
the  question  of  significance,  for  the  value  for 
Lawrence  only  exceeds  that  obtained  for  Meramec 
by  -1139. 

In  other  cases  greater  precision  may  be  required. 
In  the  table  for  z  the  five  values  6,  8,  12,  24,  oc  are 
chosen  for  being  in  harmonic  progression,  and  so 
facilitating  interpolation,  if  we  use  \\n  as  the  variable. 
If  we  have  to  interpolate  both  for  nx  and  n2}  we  proceed 
in  three  steps.  We  find  first  the  values  of  z  for  nx  — 12, 
7*2=2178,  and  for  nx  =  24,  722=2178,  and  from  these 
obtain  the  required  value  for  nx  =21,  tz2  =2178. 
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To  find  the  value  for  nx  =  12,  n2  =2178,  observe  that 
60 

=-027=5, 

2178  °' 

for  n2  =  oc  we  have  -2804,  and  for  n2  =  60  a  value  higher 
by   -0450,   so  that   -2804  +  -0275  x  -0450  =-2816   gives 
the  approximate  value  for  n2  =2178. 
Similarly  for  nx  =  24 

•2085  +-0275  x-o569  =  -2101. 

From  these  two  values  we  must  find  the  value  for 

nx  =  21  \   now 

24  1 

-^=1+-, 
21  7 

so  that  we   must  add  to  the  value  for  nx  =  24  one- 
seventh  of  its  difference  from  the  value  for  nx  =12  ; 

this  gives 

•0715 
•2101  +  J  =  '2201, 

7 
which  is  approximately  the  positive  deviation  which 
would  be  exceeded  by  chance  in  5  per  cent  of  random 
samples. 

Just  as  we  have  found  the  5  per  cent  point  for 
positive  deviations,  so  the  5  per  cent  point  for  negative 
deviations  may  be  found  by  interchanging  nx  and  n2  ; 
this  turns  out  to  be  -2978.  If  we  assume  that  our 
observed  value  does  not  transgress  the  5  per  cent  point 
in  either  deviation,  that  is  to  say  that  it  lies  in  the 
central  nine-tenths  of  its  frequency  distribution,  we 
may  say  that  the  value  of  z  for  Lawrence,  Kansas,  lies 
between  1-9101  and  2-4282  ;  these  values  being  found 
respectively  by  subtracting  the  positive  deviations  and 
adding  the  negative  deviation  to  the  observed  value. 
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The  fact  that  the  two  deviations  are  distinctly  un- 
equal, as  is  generally  the  case  when  nx  and  n2  are 
unequal  and  not  both  large,  shows  that  such  a  case  can- 
not be  treated  accurately  by  means  of  a  probable  error. 

Somewhat  more  accurate  values  than  the  above 
may  be  obtained  by  improved  methods  of  interpola- 
tion ;  the  above  method  will,  however,  suffice  for  all 
ordinary  requirements,  except  in  the  corner  of  the 
table  where  nx  exceeds  24  and  n2  exceeds  30.  For  cases 
which  fall  into  this  region,  the  following  formula  gives 
the  5  per  cent  point  within  one-hundredth  of  its  value. 
If  h  is  the  harmonic  mean  of  nx  and  n2,  so  that 

2      1       1 

then  z=—  t==- 78431  -----  1. 

Similarly  the  1  per  cent  point  is  given  approxi- 
mately by  the  formula 

2-3263  / 1       1  \ 

V  h  -  1  \«i     »i/ 

Let  us  apply  this  formula  to  find  the  5  per  cent 
points  for  the  Meramec  Highlands,  »,  =59,  n%  =5940  ; 
the  calculation  is  as  follows  : 


TABLE 

43 

\\nx 

•01695 

V^i 

10-76 

ll"2 

•00017 

\\\fh^\ 

•09294 

First  term 

•15288 

2jh 

•01712 
•00856 

\\nx-\\n2 

•01678 

Second  term 
Difference 

•01316 

\\h 

•1397 

h 

n6-8 

Sum 

•1660 

The   5    per   cent   point   for   positive   deviations   is 
therefore   -1397,   and   for   negative   deviations   -1660; 
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with  the  same  standards  as  before,  therefore,  we  may 
say  that  the  value  for  Meramec  lies  between  1-8768 
and  2-1825  ;  the  large  overlap  of  this  range  with  that 
of  Lawrence  shows  that  the  correlations  found  in  the 
two  districts  are  not  significantly  different. 

42.  Analysis  of  Variance  into  more  than  Two  Portions 

It  is  often  necessary  to  divide  the  total  variance 
into  more  than  two  portions  ;  it  sometimes  happens 
both  in  experimental  and  in  observational  data  that 
the  observations  may  be  grouped  into  classes  in  more 
than  one  way  ;  each  observation  belongs  to  one  class 
of  type  A  and  to  a  different  class  of  type  B.  In  such 
a  case  we  can  find  separately  the  variance  between 
classes  of  type  A  and  between  classes  of  type  B  ; 
the  balance  of  the  total  variance  may  represent 
only  the  variance  within  each  subclass,  or  there 
may  be  in  addition  an  interaction  of  causes,  so  that 
a  change  in  class  of  type  A  does  not  have  the  same 
effect  in  all  B  classes.  If  the  observations  do  not 
occur  singly  in  the  subclasses,  the  variance  within  the 
subclasses  may  be  determined  independently,  and  the 
presence  or  absence  of  interaction  verified.  Some- 
times also,  for  example,  if  the  observations  are  fre- 
quencies, it  is  possible  to  calculate  the  variance  to  be 
expected  in  the  subclasses. 

Ex.  40.  Diurnal  and  annual  variation  of  rain 
frequency. — The  following  frequencies  of  rain  at  differ- 
ent hours  in  different  months  were  observed  at  Rich- 
mond during  10  years  (quoted  from  Shaw,  with  two 
corrections  in  the  totals). 
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The  variance  may  be  analysed  as  follows : 

TABLE  45 


Degrees  of 
Freedom. 

Sum  of 
Squares. 

Mean  Square. 

Months  . 
Hours     . 
Remainder     . 

Total     . 

ii 

23 

253 

6,568-58 

1,539-33 
3,8i9-58 

597-144 
66-928 
15-097 

287 

11,927-50 

The  mean  of  the  288  values  given  in  the  table  is 
24-7,  and  if  the  original  data  had  represented  inde- 
pendent sampling  chances,  we  should  expect  the  mean 
square  residue  to  be  nearly  as  great  as  this  or  greater, 
if  the  rain  distribution  during  the  day  differs  in  different 
months.  Clearly  the  residual  variance  is  subnormal, 
and  the  reason  for  this  is  obvious  when  we  consider 
that  the  probability  that  it  should  be  raining  in  the 
2nd  hour  is  not  independent  of  whether  it  is  raining 
or  not  in  the  1st  hour  of  the  same  day.  Each  shower 
will  thus  probably  have  been  entered  several  times, 
and  the  values  for  neighbouring  hours  in  the  same 
month  will  be  positively  correlated.  Much  of  the 
random  variation  has  thus  been  included  in  that 
ascribed  to  the  months,  and  probably  accounts  for 
the  very  irregular  sequence  of  the  monthly  totals.  The 
variance  between  the  24  hours  is,  however,  quite 
significantly  greater  than  the  residual  variance,  and 
this  shows  that  the  rainy  hours  have  been  on  the 
whole  similar  in  the  different  months,  so  that  the 
figures  clearly  indicate  the  influence  of  time  of  day. 
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From  the  data  it  is  not  possible  to  estimate  the  influence 
of  time  of  year,  or  to  discuss  whether  the  effect  of  time 
of  day  is  the  same  in  all  months. 

Ex.  41.  Analysis  of  variation  in  experimental 
field  trials. — The  table  on  the  following  page  gives  the 
yield  in  lb.  per  plant  in  an  experiment  with  potatoes 
(Rothamsted  data).  A  plot  of  land,  the  whole  of 
which  had  received  a  dressing  of  dung,  was  divided 
into  36  patches,  on  which  12  varieties  were  grown, 
each  variety  having  3  patches  scattered  over  the  area. 
Each  patch  was  divided  into  three  lines,  one  of  which 
received,  in  addition  to  dung,  a  basal  dressing  only, 
containing  no  potash,  while  the  other  two  received 
additional  dressings  of  sulphate  and  chloride  of  potash 
respectively. 

From  data  of  this  sort  a  variety  of  information  may 
be  derived.  The  total  yields  of  the  36  patches  give 
us  35  degrees  of  freedom,  of  which  11  represent 
differences  among  the  12  varieties,  and  24  represent 
the  differences  between  different  patches  growing  the 
same  variety.  By  comparing  the  variance  in  these 
two  classes  we  may  test  the  significance  of  the  varietal 
differences  in  yield  for  the  soil  and  climate  of  the 
experiment.  The  72  additional  degrees  of  freedom 
given  by  the  yields  of  the  separate  rows  consist  of 
2  due  to  manurial  treatment,  which  we  can  subdivide 
into  one  representing  the  differences  due  to  a  potash 
dressing  as  against  the  basal  dressing,  and  a  second 
representing  the  manurial  difference  between  the 
sulphate  and  the  chloride  ;  and  70  more  representing 
the  differences  observed  in  manurial  response  in  the 
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different  patches.  These  latter  may  in  turn  be  divided 
into  22  representing  the  difference  in  manurial  response 
of  the  different  varieties,  and  48  representing  the 
differences  in  manurial  response  in  different  patches 
growing  the  same  variety.  To  test  the  significance  of 
the  manurial  effects,  we  may  compare  the  variance  in 
each  of  the  two  manurial  degrees  of  freedom  with  that 
in  the  remaining  70  ;  to  test  the  significance  of  the 
differences  in  varietal  response  to  manure,  we  compare 
the  variance  in  the  22  degrees  of  freedom  with  that  in 
the  48  ;  while  to  test  the  significance  of  the  difference 
in  yield  of  the  same  variety  in  different  patches,  we 
compare  the  24  degrees  of  freedom  representing  the 
differences  in  the  yields  of  different  patches  growing 
the  same  variety  with  the  48  degrees  representing  the 
differences  of  manurial  response  on  different  patches 
growing  the  same  variety. 

For  each  variety  we  shall  require  the  total  yield 
for  the  whole  of  each  patch,  the  total  yield  for  the 
3  patches  and  the  total  yield  for  each  manure  ;  we 
shall  also  need  the  total  yield  for  each  manure  for  the 
aggregate  of  the  12  varieties  ;  these  values  are  given 
on  next  page  (Table  47). 


[Table 
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The  sum  of  the  squares  of  the  deviations  of  all 
the  108  values  from  their  mean  is  71-699  ;  divided, 
according  to  patches,  in  36  classes  of  3,  the  value  for 
the  36  patches  is  61-078  ;  dividing  this  again  accord- 
ing to  varieties  into  12  classes  of  3,  the  value  for  the 
12  varieties  is  43-638.  We  may  express  the  facts  so 
far  as  follows  : 

TABLE  48 


Variance. 

Degrees  of 
Freedom. 

Sum  of 
Squares. 

Mean  Square. 

Log  (S.D.). 

Between  varieties 
Between  patches  for 
same  variety  . 

Within  patches  . 

Total 

II 
24 

72 

43-6384 
I7-440I 
IO-6204 

3-967 

•727 

•689O 

-  -1594 

IO7 

71-6989 

The  value  of  z,  found  as  the  difference  of  the  loga- 
rithms in  the  last  column,  is  -8484,  the  corresponding 
1  per  cent  value  being  about  -564  ;  the  effect  of  variety 
is  therefore  very  significant. 

Of  the  variation  within  the  patches  the  portion 
ascribable  to  the  two  differences  of  manurial  treatment 
may  be  derived  from  the  totals  for  the  three  manurial 
treatments.  The  sum  of  the  squares  of  the  three 
deviations,  divided  by  36,  is  -3495  ;  of  this  the  square 
of  the  difference  of  the  totals  for  the  two  potash  dress- 
ings, divided  by  72,  contributes  -0584,  while  the  square 
of  the  difference  between  their  mean  and  the  total 
for  the  basal  dressing,  divided  by  54,  gives  the 
remainder,   -2913.      It  is  possible,   however,   that  the 
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whole  effect  of  the  dressings  may  not  appear  in  these 
figures,  for  if  the  different  varieties  had  responded  in 
different  ways,  or  to  different  extents,  to  the  dressings, 
the  whole  effect  would  not  appear  in  the  totals.  The 
seventy  remaining  degrees  of  freedom  would  not  be 
homogeneous.  The  36  values,  giving  the  totals  for 
each  manuring  and  for  each  variety,  give  us  35 
degrees  of  freedom,  of  which  11  represent  the  differ- 
ences of  variety,  2  the  differences  of  manuring,  and 
the  remaining  22  show  the  differences  in  manurial 
response  of  the  different  varieties.  The  analysis  of 
this  group  is  shown  below  : 


TABLE  49 


Variance  due  to 

Degrees  of 
Freedom. 

Sum  of 
Squares. 

Mean  Square. 

Potash  dressing 
Sulphate  v.  chloride  . 
Differential  response  of  varieties  . 
Differential    response    in    patches 
with  same  variety  . 

I 

I 

22 

48 

•2913 

•O584 

2*I9I  I 

8-0798 

•29I  I 
•O584 
•O996 

•1683 

72 

IO-6204 

To  test  the  significance  of  the  variation  observed 
in  the  yield  of  patches  bearing  the  same  variety,  we 
may  compare  the  value  -727  found  above  from  24 
degrees  of  freedom,  with  -1683  just  found  from  48 
degrees.  The  value  of  z>  half  the  difference  of  the 
logarithms,  is  -7316,  while  the  1  per  cent  point  is 
about  -358.  The  evidence  for  unequal  fertility  of  the 
different   patches    is   therefore    unmistakable.      As    is 
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always  found  in  careful  field  trials,  local  irregularities 
in  the  nature  or  depth  of  the  soil  materially  affect  the 
yields.  In  this  case  the  soil  irregularity  was  perhaps 
combined  with  unequal  quality  or  quantity  of  the  dung 
supplied. 

There  is  no  sign  of  differential  response  among  the 
varieties  ;  indeed,  the  difference  between  patches  with 
different  varieties  is  less  than  that  found  for  patches 
with  the  same  variety.  The  difference  between  the 
values  is  not  significant;  £  =  -2623,  while  the  5  per 
cent  point  is  about  -33. 

Finally,  the  effect  of  the  manurial  dressings  tested 
is  small  ;  the  difference  due  to  potash  is  indeed  greater 
than  the  value  for  the  differential  effects,  which  we  may 
now  call  random  fluctuations,  but  z  is  only  -3427,  and 
would  require  to  be  about  7  to  be  significant.  With 
no  total  response,  it  is  of  course  to  be  expected,  though 
not  as  a  necessary  consequence,  that  the  differential 
effects  should  be  insignificant.  Evidently  the  plants 
with  the  basal  dressing  had  all  the  potash  necessary, 
and  in  addition  no  apparent  effect  on  the  yield  was 
produced  by  the  difference  between  chloride  and 
sulphate  ions. 


[Table 


212 


STATISTICAL  METHODS 


TABLE 
5  Per  Cent  Points  of 


Values 

-■ 

2. 

3. 

4-     ' 

I 

2-5421 

2-6479 

2-6870 

2-7071 

2 

1*4592 

1-4722 

1-4765 

1-4787 

3 

i*i577 

IT284 

i*ii37 

1-1051 

4 

I-02I2 

•969O 

•9429 

•9272 

5 

•9441 

•8777 

•8441 

•8236 

6 

•8948 

•8l88 

•7798 

•7558 

7 

•8606 

•7777 

•7347 

•7080 

8 

•8355 

•7475 

•7014 

•6725 

9 

•8163 

•7242 

•6757 

•6450 

IO 

•80I2 

•7058 

•6553 

•6232 

,i 

•7889 

•6909 

•6387 

•6055 

12 

•7788 

•6786 

•6250 

•5907 

13 

•7703 

•6682 

•6i34 

•5783 

14 

•763O 

•6594 

•6036 

•5677 

15 

•7568 

•6518 

•595o 

•5585 

S* 

16 

•75H 

•6451 

•5876 

•5505 

^O 

17 

•7466 

•6393 

•5811 

•5434 

18 

•7424 

•6341 

•5753 

•5371 

J 

19 

•7386 

•6295 

•5701 

•5315 

20 

•7352 

•6254 

•5654 

•5265 

21 

•7322 

•6216 

•5612 

•5219 

22 

•7294 

•6182 

•5574 

•5178 

23 

•7269 

•6151 

•554o 

•5140 

24 

•7246 

•6123 

•55o8 

•5106 

25 

•7225 

•6097 

•5478 

•5074 

26 

•7205 

•6073 

•5451 

•5045 

27 

•7187 

•6051 

•5427 

•5017 

28 

•7171 

•6030 

•5403 

•4992 

29 

•7155 

•601 1 

•5382 

•4969 

30 

•7141 

'5994 

•5362 

•4947 

60 

•6933 

•5738 

•5073 

•4632 

00 

•6729 

•5486 

•4787 

•4319 
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VI 


the  Distribution  of  z 


of  nv 

5- 

6. 

8. 

12. 

24. 

oc  . 

2-7194 

2-7276 

2-7380 

2-7484 

2-7588 

2-7693 

1*4800 

1-4808 

1-4819 

1-4830 

1-4840 

1-4851 

1-0994 

1-0953 

1-0899 

1-0842 

1-0781 

1-0716 

•9168 

.9093 

•8993 

•8885 

•8767 

•8639 

•8097 

•7997 

•7862 

•7714 

•7550 

•7368 

•7394 

•7274 

•7112 

•6931 

•6729 

•6499 

•6896 

•6761 

•6576 

•6369 

•6i34 

•5862 

•6525 

•6378 

•6i75 

•5945 

•5682 

•5371 

•6238 

•6080 

•5862 

•5613 

•5324 

•4979 

•6009 

•5843 

•5611 

•5346 

•5035 

•4657 

•5822 

•5648 

•5406 

•5126 

•4795 

•4387 

•5666 

.5487 

•5234 

•4941 

•4592 

•4156 

•5535 

•535o 

•5089 

•4785 

•4419 

•3957 

•5423 

•5233 

•4964 

•4649 

•4269 

•3782 

•5326 

•5i3i 

•4855 

•4532 

•4138 

•3628 

•5241 

•5042 

•4760 

•4428 

•4022 

•3490 

•5166 

•4964 

•4676 

•4337 

•3919 

•3366 

•5099 

•4894 

•4602 

•4255 

•3827 

•3253 

•5040 

•4832 

•4535 

•4182 

•3743 

•3151  ' 

•4986 

•4776 

•4474 

•4116 

•3668 

•3057 

•4938 

•4725 

•4420 

•4055 

•3599 

•2971 

•4894 

•4679 

•4370 

•4001 

•3536 

•2892 

•4854 

•4636 

•4325 

•3950 

•3478 

•2818 

•4817 

•4598 

•4283 

•3904 

•3425 

•2749 

•4783 

•4562 

•4244 

•3862 

•3376 

•2685 

•4752 

•4529 

•4209 

•3823 

•3330 

•2625 

•4723 

•4499 

•4176 

•3786 

•3287 

•2569 

•4696 

•4471 

•4146 

•3752 

•3248 

•2516 

•4671 

•4444 

•4117 

•3720 

•3211 

•2466 

•4648 

•4420 

•4090 

•3691 

•3176 

•2419 

•43i  1 

•4064 

•3702 

•3255 

•2654 

•1644 

•3974 

•37o6 

•3309 

•2804 

•2085 

0 
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TABLE 
i  Per  Cent  Points  of 


Values 

1. 

2. 

3 

4- 

I 

4*1535 

4-2585 

4-2974 

4-3175 

2 

2-2950 

2-2976 

2-2984 

2-2988 

3 

1-7649 

1-7140 

1-6915 

1-6786 

4 

1-5270 

1-4452 

1-4075 

1-3856 

5 

1*3943 

1-2929 

1-2449 

1-2164 

6 

1-3103 

1-1955 

1-1401 

i- 1068 

7 

1-2526 

1-1281 

1-0672 

1-0300 

8 

1-2106 

1-0787 

I*Ol35 

•9734 

9 

1-1786 

1-0411 

•9724 

.9299 

IO 

1*1535 

1-0114 

•9399 

•8954 

ii 

••1333 

•9874 

•9136 

•8674 

12 

1  • 1 1 66 

.9677 

•8919 

•8443 

13 

1-1027 

•95" 

•8737 

•8248 

14 

I  -0909 

•937o 

•8581 

•8082 

15 

1-0807 

•9249 

•8448 

•7939 

16 

1-0719 

•9144 

•8331 

•7814 

o 

17 

1-0641 

.9051 

•8229 

•7705 

1/1 
0 

3 

18 

1-0572 

•8970 

•8138 

•7607 

'rt 

19 

1-0511 

•8897 

•8057 

•7521 

> 

20 

1-0457 

•8831 

•7985 

•7443 

21 

1  -0408 

•8772 

•7920 

•7372 

22 

1-0363 

•8719 

•7860 

•7309 

23 

1-0322 

•8670 

•7806 

•7251 

24 

1-0285 

•8626 

•7757 

•7197 

25 

1-0251 

•8585 

•7712 

•7148 

26 

1-0220 

•8548 

•7670 

•7103 

27 

I-OI9I 

•8513 

•7631 

•7062 

28 

POI64 

•8481 

•7595 

•7023 

29 

1-0139 

•8451 

•7562 

•6987 

30 

1-0116 

•8423 

•7531 

•6954 

60 

•9784 

•8025 

•7086 

•6472 

00 

•9462 

.7636 

•6651 

•5999 
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VI. — Continued 

the  Distribution  of  z 


of  nx. 

5- 

6. 

8. 

12. 

24. 

00 . 

4-3297 

4-3379 

4-3482 

4-3585 

4-3689 

4*3794 

2*2991 

2-2992 

2-2994 

2-2997 

2*2999 

2-3001 

1-6703 

1-6645 

1-6569 

1-6489 

1-6404 

1-6314 

1-3711 

1-3609 

1*3473 

1-3327 

1-3170 

1*3000 

1-1974 

1-1838 

1-1644 

I-I457 

1-1239 

1-0997 

1-0843 

1-0680 

1-0460 

1-0218 

•9948 

•9643 

1-0048 

•9864 

•9614 

•9335 

•9020 

•8658 

'9459 

•9259 

•8983 

•8673 

•8319 

•7904 

•9006 

•8791 

•8494 

•8157 

•7769 

•7305 

•8646 

•8419 

•8104 

•7744 

•7324 

•6816 

•8354 

•8116 

•7785 

•7405 

•6958 

•6408 

•81 1 1 

•7864 

•7520 

•7122 

•6649 

•6061 

.7907 

•7652 

•7295 

•6882 

•6386 

•576i 

.7732 

•747i 

•7103 

•6675 

•6159 

•5500 

•7582 

•7314 

•6937 

•6496 

•5961 

•5269 

•7450 

•7177 

•6791 

•6339 

•5786 

•5064 

•7335 

•7057 

•6663 

•6199 

•5630 

•4879 

.7232 

•6950 

•6549 

•6075 

•5516 

•4712 

•7140 

•6854 

•6447 

•5964 

.5366 

•4560 

•7058 

•6768 

•6355 

•5864 

•5253 

•4421 

•6984 

•6690 

•6272 

•5773 

•5150 

•4294 

•6916 

•6620 

•6196 

•5691 

•5056 

•4176 

•6855 

•6555 

•6127 

•5615 

•4969 

•4068 

•6799 

•6496 

•6064 

•5545 

•4890 

•3967 

•6747 

•6442 

•6006 

•5481 

•4816 

•3872 

•6699 

•6392 

•5952 

•5422 

•4748 

•3784 

•6655 

•6346 

•5902 

•5367 

•4685 

•3701 

•6614 

•6303 

.5856 

•5316 

•4626 

•3624 

•6576 

•6263 

•58i3 

•5269 

•4570 

•3550 

•6540 

•6226 

•5773 

•5224 

•4519 

•348i 

•6028 

•5687 

•5189 

•4574 

•3746 

•2352 

•5522 

•5152 

•4604 

•3908 

•2913 

0 

VIII 

FURTHER  APPLICATIONS  OF  THE 
ANALYSIS  OF  VARIANCE 

43.  We  shall  in  this  chapter  give  examples  of  the 
further  applications  of  the  method  of  the  analysis  of 
variance  developed  in  the  last  chapter  in  connexion 
with  the  theory  of  intraclass  correlations.  It  is  im- 
possible in  a  short  space  to  give  examples  of  all  the 
different  applications  which  may  be  made  of  this 
method  ;  we  shall  therefore  limit  ourselves  to  those 
of  the  most  immediate  practical  importance,  paying 
especial  attention  to  those  cases  where  erroneous 
methods  have  been  largely  used,  or  where  no  alterna- 
tive method  of  attack  has  hitherto  been  put  forward. 

44.  Fitness  of  Regression  Formulae 

There  is  no  more  pressing  need  in  connexion  with 
the  examination  of  experimental  results  than  to  test 
whether  a  given  body  of  data  is  or  is  not  in  agree- 
ment with  any  suggested  hypothesis.  The  previous 
chapters  have  largely  been  concerned  with  such 
tests  appropriate  to  hypotheses  involving  frequency 

of  occurrence,   such  as  the   Mendelian  hypothesis  of 
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segregating  genes,  or  the  hypothesis  of  linear  arrange- 
ment in  linkage  groups,  or  the  more  general  hypotheses 
of  the  independence  or  correlation  of  variates.  More 
frequently,  however,  it  is  desired  to  test  hypotheses 
involving,  in  statistical  language,  the  form  of  regres- 
sion lines.  We  may  wish  to  test,  for  example,  if  the 
growth  of  an  animal,  plant  or  population  follows  an 
assigned  law,  if  for  example  it  increases  with  time  in 
arithmetic  or  geometric  progression,  or  according  to 
the  so-called  "  autocatalytic  "  law  of  increase  ;  we  may 
wish  to  test  if  with  increasing  applications  of  manure, 
plant  growth  increases  in  accordance  with  the  laws 
which  have  been  put  forward,  or  whether  in  fact  the 
data  in  hand  are  inconsistent  with  such  a  supposition. 
Such  questions  arise  not  only  in  crucial  tests  of  widely 
recognised  laws,  but  in  every  case  where  a  relation, 
however  empirical,  is  believed  to  be  descriptive  of  the 
data,  and  are  of  value  not  only  in  the  final  stage  of 
establishing  the  laws  of  nature,  but  in  the  early  stages 
of  testing  the  efficiency  of  a  technique.  The  methods 
we  shall  put  forward  for  testing  the  Goodness  of  Fit 
of  regression  lines  are  aimed  not  only  at  simplifying 
the  calculations  by  reducing  them  to  a  standard  form, 
and  so  making  accurate  tests  possible,  but  at  so  dis- 
playing the  whole  process  that  it  may  be  apparent 
exactly  what  questions  can  be  answered  by  such  a 
statistical  examination  of  the  data. 

If  for  each  of  a  number  of  selected  values  of  the 
independent  variate  x  a  number  of  observations  of 
the  dependent  variate  y  is  made,  let  the  number  of 
values  of  x  available  be  a  ;    then  a  is  the  number 
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of  arrays  in  our  data.  Designating  any  particular 
array  by  means  of  the  suffix  p,  the  number  of  observa- 
tions in  any  array  will  be  denoted  by  np,  and  the  mean 
of  their  values  by  yp  ;  y  being  the  general  mean  of 
all  the  values  of  y.  Then  whatever  be  the  nature  of 
the  data,  the  purely  algebraic  identity 

S(y  -y)2=S{np(yP  ~y)2}  +SS(>  -yp)* 

expresses  the  fact  that  the  sum  of  the  squares  of  the 
deviations  of  all  the  values  of  y  from  their  general 
mean  may  be  broken  up  into  two  parts,  one  repre- 
senting the  sum  of  the  squares  of  the  deviations  of  the 
means  of  the  arrays  from  the  general  mean,  each 
multiplied  by  the  number  in  the  array,  while  the 
second  is  the  sum  of  the  squares  of  the  deviations  of 
each  observation  from  the  mean  of  the  array  in  which 
it  occurs.  This  resembles  the  analysis  used  for  intra- 
class  correlations,  save  that  now  the  number  of 
observations  may  be  different  in  each  array.  The 
deviations  of  the  observations  from  the  means  of  the 
arrays  are  due  to  causes  of  variation,  including  errors 
of  grouping,  errors  of  observation,  and  so  on,  which 
are  not  dependent  upon  the  value  of  x  ;  the  standard 
deviation  due  to  these  causes  thus  provides  a  basis  for 
comparison  by  which  we  can  test  whether  the  devia- 
tions of  the  means  of  the  arrays  from  the  values 
expected  by  hypothesis  are  or  are  not  significant. 

Let  Y^  represent  the   mean    value    in    any  array 
expected  on  the  hypothesis  to  be  tested,  then 
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will  measure  the  discrepancy  between  the  data  and  the 
hypothesis.  In  comparing  this  with  the  variations 
within  the  arrays,  we  must  of  course  consider  how 
many  degrees  of  freedom  are  available,  in  which  the 
observations  may  differ  from  the  hypothesis.  In  some 
cases,  which  are  relatively  rare,  the  hypothesis  specifies 
the  actual  mean  value  to  be  expected  in  each  array  ; 
in  such  cases  a  degrees  of  freedom  are  available, 
a  being  the  number  of  the  arrays.  More  frequently, 
the  hypothesis  specifies  only  the  form  of  the  regression 
line,  having  one  or  more  parameters  to  be  determined 
from  the  observations,  as  when  we  wish  to  test  if  the 
regression  can  be  represented  by  a  straight  line,  so 
that  our  hypothesis  is  justified  if  any  straight  line  fits 
the  data.  In  such  cases  to  find  the  number  of  degrees 
of  freedom  we  must  deduct  from  a  the  number  of 
parameters  obtained  from  the  data. 

Ex.  42.  Test  of  straightness  of  regression  line. — 
The  following  data  are  taken  from  a  paper  by  A.  H. 
Hersh  on  the  influence  of  temperature  on  the  number 
of  eye  facets  in  Drosophila  melanogaster,  in  various 
homozygous  and  heterozygous  phases  of  the  "  bar  " 
factor.  They  represent  females  heterozygous  for 
"  full  "  and  "  ultra-bar/'  the  facet  number  being 
measured  in  factorial  units,  effectively  a  logarithmic 
scale.  Can  the  influence  of  temperature  on  facet 
number  be  represented  by  a  straight  line,  in  these 
units  ? 
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TABLE   50 


Temperature  °  C.       1 

17°. 

19°.      2 

i°. 

23°. 

25°.      2 

7°. 

2 

9°-        3 

i°.       Total. 

+    807 

3 

! 

1 

5 

+    7-07 

5 

2 

5 

1 

13 

+    607                 1 

3 

7 

3 

23 

+     5-07                     2 

s 

9 

2 

1 

37 

+     407                   2 

2 

10 

16 

2 

50 

+     307                    I 

2 

10 

12 

6 

1 

3 

44 

+     207 

7 

5 

14 

16 

2 

2 

46 

+     IO7 

3 

4 

14 

21 

8 

9 

59 

+        -07 

3 

7 

2'. 

7 

19 

I 

63 

~      -93 

1 

7 

12 

11 

24 

3 

] 

59 

-    193 

1 

9 

14 

22 

8 

f 

> 

60 

-    293 

2 

1 

5 

12 

15 

15 

1 

t 

54 

-    393 

2 

19 

18 

(4 

i( 

D 

1              94 

~    493 

1 

4 

4 

26 

6          6 

-    593 

2 

2 

'9 

i; 

I            J 

3             50 

-    693 

2 

[I 

3 

J 

?             50 

-    793 

3 

1 

8 

1 

3 

3             28 

-    893 

1 

2 

5 

5              13 

-    993 

4          4 

-  1093 

IC 

> 

2              12 

-1193 

1 

[           2               4 

-12-93 

•5 

[•5            2 

-1393 

•5 

•5            1 

-  1493 

-  1593 

83        K 

>o 

86 

I               1 

Total           9 

J 

5 

4 

122       x; 

?7 

9< 

J        5. 

J           823 

There  are  9  arrays  representing  9  different 
temperatures.  Taking  a  working  mean  at  -  1-93 
we  calculate  the  total  and  average  excess  over  the 
working  mean  from  each  array,  and  for  the  aggre- 
gate of  all  9.  Each  average  is  found  by  dividing 
the  total  excess  by  the  number  in  the  array  ;  three 
decimal  places  are  sufficient  save  in  the  aggregate, 
where  four  are  needed.     We  have 
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TABLE  5: 


Array. 

15- 

17- 

19. 

21. 

23- 

25. 

27. 

29. 

31. 

Aggregate. 

Total  \ 
excess  j 
Mean    ^ 
excess  / 

583 

6-478 

294 

5*444 

367 

4-422 

225 
2-250 

-43 
-•500 

+  37 
+  •303 

-369 
-2693 

-463-5 
-4-73o 

-306-5 
-5-783 

+  324 

+  •3937 

The  sum  of  the  products  of  these  nine  pairs  of 
numbers,  less  the  product  of  the  final  pair,  gives  the 
value  of 

s{np(yP-y)2}  =  I2>370, 

while  from  the  distribution  of  the  aggregate  of  all  the 
values  of  y  we  have 

S(y  -yy  =  16,202, 

whence  is  deduced  the  following  table  : 

TABLE  52 


Variance. 

Degrees  of 
Freedom. 

Sum  of 
Squares. 

Mean  Square. 

Between  arrays     . 
Within  arrays 
Total    . 

8 
814 

12,370 
3,832 

4-708 

822 

l6,202 

The  variance  within  the  arrays  is  thus  only  about 
4-7  ;  the  variance  between  the  arrays  will  be  made  up 
of  a  part  which  can  be  represented  by  a  linear  regres- 
sion, and  of  a  part  which  represents  the  deviations  of 
the  observed  means  of  arrays  from   a   straight   line. 
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To  find  the  part  represented  by  a  linear  regression, 

calculate 

S(x  -x)2  =4742-21 
and 

S(x-x)(y-y)=  -7535*38, 

which  latter  can  be  obtained  by  multiplying  the  above 
total  excess  values  by  x  -  x ;  then  since 

(7535'38)2 

y/DJD  J  J  =  11,974 

4742-21 
we  may  complete  the  analysis  as  follows  : 

TABLE  53 


Variance  between  Arrays  due  to                     Degrejjof             S«W 

Mean  Square. 

Linear  regression    .... 
Deviations  from  regression   . 
Total       . 

1 
7 

H.974 
396 

56-6 

8 

12,370 

It  is  useful  to  check  the  figure,  396,  found  by 
differences,  by  calculating  the  actual  value  of  Y  for 
the  regression  formula  and  evaluating 

SKCJV-Y,)'}; 

such  a  check  has  the  advantage  that  it  shows  to  which 
arrays  in  particular  the  bulk  of  the  discrepancy  is  due, 
in  this  case  to  the  observations  at  23  and  250  C. 

The  deviations  from  linear  regression  are  evidently 
larger  than  would  be  expected,  if  the  regression  were 
really  linear,  from  the  variations  within  the  arrays. 
For  the  value  of  2,  we  have 
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TABLE  54 


Mean  Square. 

Natural  Log. 

£  Log. 

56-6 
4-708 

Difference  (z) 

4-0360 

1-5493 

2-0180 

•7746 

1-2434 

while  the  1  per  cent  point  is  about  -488.  There  can 
therefore  be  no  question  of  the  statistical  significance 
of  the  deviations  from  the  straight  line,  although  the 
latter  accounts  for  the  greater  part  of  the  variation. 

Note  that  ShepparcTs  correction  is  not  to  be 
applied  in  making  this  test  ;  a  certain  proportion  both 
of  the  variation  within  arrays,  and  of  the  deviations 
from  the  regression  line  is  ascribable  to  errors  of 
grouping,  but  to  deduct  from  each  the  average  error 
due  to  this  cause  would  be  unduly  to  accentuate  their 
inequality,  and  so  to  render  inaccurate  the  test  of 
significance. 


45.  The  "  Correlation  Ratio  "  77 

We  have  seen  how,  from  the  sum  of  the  squares 
of  the  deviations  of  all  observations  from  the  general 
mean,  a  portion  may  be  separated  representing  the 
differences  between  different  arrays.  The  ratio  which 
this  bears  to  the  whole  is  often  denoted  by  the  symbol 
7j2,  so  that 

V2=S{np(yp-yy}  +  S(y-yy, 

and  the  square  root  of  this  ratio,  y,  is  called  the  correla- 
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tion  ratio  of  y  on  x.     Similarly  if  Y  is  the  hypothetical 
regression  function,  we  may  define  R,  so  that 

R*=S{np(Y-yy}  +  S(y-yy, 

then  R  will  be  the  correlation  coefficient  between  jy  and 
Y,  and  if  the  regression  is  linear,  R2  =r2,  where  r 
is  the  correlation  coefficient  between  x  and  y.  From 
these  relations  it  is  obvious  that  rj  exceeds  R,  and  thus 
that  rj  provides  an  upper  limit,  such  that  no  regression 
function  can  be  found,  the  correlation  of  which  with 
y  is  higher  than  r/. 

As  a  descriptive  statistic  the  utility  of  the  correla- 
tion ratio  is  extremely  limited.  It  will  be  noticed  that 
the  number  of  degrees  of  freedom  in  the  numerator  of 
rj2  depends  on  the  number  of  the  arrays,  so  that,  for 
instance  in  Example  42,  the  value  of  77  obtained  will 
depend,  not  only  on  the  range  of  temperatures  explored, 
but  on  the  number  of  temperatures  employed  within  a 
given  range. 

To  test  if  an  observed  value  of  the  correlation 
ratio  is  significant  is  to  test  if  the  variation  between 
arrays  is  significantly  greater  than  is  to  be  expected 
without  correlation,  from  the  variation  within  arrays, 
and  this  can  be  done  from  the  analysis  of  variance 
(Table  52)  by  means  of  the  table  of  z.  Attempts  have 
been  made  to  test  the  significance  of  the  correlation 
ratio  by  calculating  for  it  a  standard  error,  but  such 
attempts  overlook  the  fact  that,  even  with  indefinitely 
large  samples,  the  distribution  of  rj  for  zero  correla- 
tion does  not  tend  to  normality,  unless  the  number  of 
arrays  also  is  increased  without  limit.     On  the  contrary, 
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with  very  large  samples,  when  N  is  the  total  number 
of  observations,  Ntj2  tends  to  be  distributed  as  is  %2 
when  n,  the  number  of  degrees  of  freedom,  is  equal  to 
(a  -  1). 

46.  Blakeman's  Criterion 

In  the  same  sense  that  y2  measures  the  difference 
between  different  arrays,  so  rj2  -  R2  measures  the 
aggregate  deviation  of  the  means  of  the  arrays  from 
the  hypothetical  regression  line.  The  attempt  to 
obtain  a  criterion  of  linearity  of  regression  by  com- 
paring this  quantity  to  its  standard  error,  results  in 
the  test  known  as  Blakeman's  criterion.  In  this 
test,  also,  no  account  is  taken  of  the  number  of  the 
arrays,  and  in  consequence  it  does  not  provide  even 
a  first  approximation  in  estimating  what  values  of 
t}2  -r2  are  permissible.  Similarly  with  rf  with  zero 
correlation,  so  with  r\2  -r2,  the  correlation  being 
linear,  if  the  number  of  observations  is  increased 
without  limit,  the  distribution  does  not  tend  to  nor- 
mality, but  that  of  N(rj2  -r2)  tends  to  be  distributed  as 
is  x2  when  n  =a  -  2.  Its  mean  value  is  then  {a  -  2), 
and  to  ignore  the  value  of  a  is  to  disregard  the  main 
feature  of  its  sampling  distribution. 

In  Example  42  we  have  seen  that  with  9  arrays 
the  departure  from  linearity  was  very  markedly 
significant ;  it  is  easy  to  see  that  had  there  been  90 
arrays,  with  the  same  values  of  rj2  and  r2,  the  departure 
from  linearity  would  have  been  even  less  than  the 
expectation  based  on  the  variation  within  each  array. 
Using  Blakeman's  criterion,  however,  these  two  oppo- 
site conditions  are  indistinguishable. 

Q 
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As  in  other  cases  of  testing  goodness  of  fit,  so  in 
testing  regression  lines  it  is  essential  that  if  any 
parameters  have  to  be  fitted  to  the  observations,  this 
process  of  fitting  shall  have  been  efficiently  carried  out. 

Some  account  of  efficient  methods  has  been  given 
in  Chapter  V.  In  general,  save  in  the  more  compli- 
cated cases,  of  which  this  book  does  not  treat,  the 
necessary  condition  may  be  fulfilled  by  the  procedure 
known  as  the  Method  of  Least  Squares,  by  which  the 
measure  of  deviation 

is  reduced  to  a  minimum  subject  to  the  hypothetical 
conditions  which  govern  the  form  of  Y. 

47.  Significance  of  the  Multiple  Correlation  Coefficient 

If,  as  in  Section  29  (p.  132),  the  regression  of 
a  dependent  variate  y  on  a  number  of  independent 
variates  x1}  x2,  x3  is  expressed  in  the  form 

Y  =b1xl  +  32x2+d3x3, 

then  the  correlation  between  y  and  Y  is  greater  than 
the  correlation  of  y  with  any  other  linear  function  of 
the  independent  variates,  and  thus  measures,  in  a  sense, 
the  extent  to  which  the  value  of  y  depends  upon,  or  is 
related  to,  the  combined  variation  of  these  variates. 
The  value  of  the  correlation  so  obtained,  denoted  by  R, 
may  be  calculated  from  the  formula 

R»«{*iS(*oO  +  S2S(x2y)  +^S(*dO}+SW). 
The  multiple  correlation,  R,  differs  from  the  correla- 
tion obtained  with  a  single  independent  variate  in  that 
it  is  always  positive  ;  moreover,  it  has  been  recognised 


THE  ANALYSIS  OF  VARIANCE 


227 


in  the  case  of  the  multiple  correlation  that  its  random 
sampling  distribution  must  depend  on  the  number  of 
independent  variates  employed.  The  exact  treatment 
is  in  fact  strictly  parallel  to  that  developed  above 
for    the    correlation   ratio,  with  a  similar  analysis  of 


variance. 

In  the  section  referred  to  we  made  use  of  the  fact 
that 

S(y2)  =S(y  -  Y)»  +  {b1S(x1y)  +S2S(x2y)  +  *8S(**y)}  J 

if  ri  is  the  number  of  observations  of  y,  and  p  the 
number  of  independent  variates,  these  three  terms  will 
represent  respectively  ri  -  1,  ri  -p  -  1,  and  p  degrees 
of  freedom.  Consequently  the  analysis  of  variance 
takes  the  form  : 

TABLE  55 


Variance  due  to 

Degrees  of  Freedom. 

Sum  of  Squares. 

Regression  function    . 
Deviations  from  the  regression 
function 

Total 

P 

ri -p—  1 

b1S{x1y)+  .    .    . 
Sf>-Y)« 

ri-\ 

so/*) 

it  being  assumed  that  y  is  measured  from  its  mean 
value. 

If  in  reality  there  is  no  connexion  between  the 
independent  variates  and  the  dependent  variate  y,  the 
values  in  the  column  headed  "  sum  of  squares  "  will 
be  divided  approximately  in  proportion  to  the  number 
of  degrees  of  freedom  ;  whereas  if  a  significant  con- 
nexion exists,  then  the  p  degrees  of  freedom  in  the 
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regression  function  will  obtain  distinctly  more  than 
their  share.  The  test,  whether  R  is  or  is  not  significant, 
is  in  fact  exactly  the  test  whether  the  mean  square 
ascribable  to  the  regression  function  is  or  is  not 
significantly  greater  than  the  mean  square  of  devia- 
tions from  the  regression  function,  and  may  be  carried 
out,  as  in  all  such  cases,  by  means  of  the  table  of  z. 

Ex.  43.  Significance  of  a  multiple  correlation. — 
To  illustrate  the  process  we  may  perform  the  test 
whether  the  rainfall  data  of  Example  24  was  signifi- 
cantly related  to  the  longitude,  latitude,  and  altitude 
of  the  recording  stations.  From  the  values  found  in 
that  example,  the  following  table  may  be  immediately 
constructed. 

TABLE  56 


Variance  due  to 

Degrees  of 
Freedom. 

Sum  of 
Squares. 

Mean  Square. 

iLog. 

Regression  formula   . 
Deviations 

Total 

3 
53 

791-7 
949.9 

263-9 

18-77 

2-7878 
I-466l 

56 

1786-6 

The  value  of  z  is  thus  1-3217  while  the  1  per  cent 
point  is  about  -714,  showing  that  the  multiple  correla- 
tion is  clearly  significant.  The  actual  value  of  the 
multiple  correlation  may  easily  be  calculated  from  the 
above  table,  for 

R2  =791-7^1786-6  =-4431, 
R— 6657; 
but  this  step  is  not  necessary  in  testing  the  significance. 
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48.  Technique  of  Plot  Experimentation 

The  statistical  procedure  of  the  analysis  of  variance 
is  essential  to  an  understanding  of  the  principles  under- 
lying modern  methods  of  arranging  field  experiments. 
The  first  requirement  which  governs  all  well-planned 
experiments  is  that  the  experiment  should  yield  not 
only  a  comparison  of  different  manures,  treatments, 
varieties,  etc.,  but  also  a  means  of  testing  the  signifi- 
cance of  such  differences  as  are  observed.  Conse- 
quently all  treatments  must  at  least  be  duplicated,  and 
preferably  further  replicated,  in  order  that  a  comparison 
of  replicates  may  be  used  as  a  standard  with  which  to 
compare  the  observed  differences.  This  is  a  require- 
ment common  to  most  types  of  experimentation  ;  the 
peculiarity  of  agricultural  field  experiments  lies  in  the 
fact,  verified  in  all  careful  uniformity  trials,  that  the 
area  of  ground  chosen  for  the  experimental  plots  may 
be  assumed  to  be  markedly  heterogeneous,  in  that  its 
fertility  varies  in  a  systematic,  and  often  a  complicated 
manner  from  point  to  point.  For  our  test  of  signifi- 
cance to  be  valid  the  difference  in  fertility  between  plots 
chosen  as  parallels  must  be  truly  representative  of  the 
differences  between  plots  with  different  treatment  ; 
and  we  cannot  assume  that  this  is  the  case  if  our  plots 
have  been  chosen  in  any  way  according  to  a  pre- 
arranged system  ;  for  the  systematic  arrangement  of 
our  plots  may  have,  and  tests  with  the  results  of  uni- 
formity trials  show  that  it  often  does  have,  features  in 
common  with  the  systematic  variation  of  fertility,  and 
thus  the  test  of  significance  is  wholly  vitiated. 
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Ex.  44.  Accuracy  attained  by  random  arrange- 
ment.— The  direct  way  of  overcoming  this  difficulty 
is  to  arrange  the  plots  wholly  at  random.  For 
example,  if  20  strips  of  land  were  to  be  used  to 
test  5  different  treatments  each  in  quadruplicate,  we 
might  take  such  an  arrangement  as  the  following, 
found  by  shuffling  20  cards  thoroughly  and  setting 
them  out  in  order  : 

TABLE  57 

BCACEEEADA 
3504  3430  3376  3334  3253  3314  3287  3361  3404  3366 

BCBDDBADCE 
3416  3291  3244  3210  3168  3195  3330  31 18  3029  3085 

The  letters  represent  5  different  treatments  ;  beneath 
each  is  shown  the  weight  of  mangold  roots  obtained  by 
Mercer  and  Hall  in  a  uniformity  trial  with  20  such 
strips. 

The  deviations  in  the  total  yield  of  each  treatment 
are 

A  B  C  D  E 

+  290      +216      -59      -243      -204; 

in  the  analysis  of  variance  the  sum  of  squares  corre- 
sponding to  "  treatment  "  will  be  the  sum  of  these 
squares  divided  by  4.  Since  the  sum  of  the  squares 
of  the  20  deviations  from  the  general  mean  is  289,766, 
we  have  the  following  analysis  : 
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TABLE  58 


Variance  due  to 

Degrees  of 
Freedom. 

Sum  of 
Squares. 

Mean  Square. 

Standard 
Deviation. 

Treatment 
Experimental  error   . 

4 
15 
19 

58,725 
231,041 

I4,68l 
15,403 
15,251 

121*1 
124*1 

123-5 

289,766 

It  will  be  seen  that  the  standard  error  of  a  single  plot 
estimated  from  such  an  arrangement  is  124-1,  whereas, 
in  this  case,  we  know  its  true  value  to  be  123-5  »  tnis 
is  an  exceedingly  close  agreement,  and  illustrates  the 
manner  in  which  a  purely  random  arrangement  of 
plots  ensures  that  the  experimental  error  calculated 
shall  be  an  unbiassed  estimate  of  the  errors  actually 
present. 

Ex.  45.  Restrictions  upon  random  arrangement. — 
While  adhering  to  the  essential  condition  that  the 
errors  by  which  the  observed  values  are  affected  shall 
be  a  random  sample  of  the  errors  which  contribute  to 
our  estimate  of  experimental  error,  it  is  still  possible 
to  eliminate  much  of  the  effect  of  soil  heterogeneity, 
and  so  increase  the  accuracy  of  our  observations,  by 
laying  restrictions  on  the  order  in  which  the  strips  are 
arranged.  As  an  illustration  of  a  method  which  is 
widely  applicable,  we  may  divide  the  20  strips  into 
5  blocks,  and  impose  the  condition  that  each  treat- 
ment shall  occur  once  in  each  block  ;  we  shall  then 
be  able  to  separate  the  variance  into  three  parts 
representing  (i.)  local  differences  between  blocks,  (ii.) 
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differences  due  to  treatment,  (iii.)  experimental  errors  ; 
and  if  the  five  treatments  are  arranged  at  random 
within  each  block,  our  estimate  of  experimental  error 
will  be  an  unbiassed  estimate  of  the  actual  errors  in 
the  differences  due  to  treatment.  As  an  example  of 
a  random  arrangement  subject  to  the  above  restriction, 
the  following  was  obtained  : 


AECDB 


CB  EDA 


ADEBC   CEBAD. 


Analysing  out,  with  the  same  data  as  before,  the 
contributions  of  local  differences  between  blocks,  and 
of  treatment,  we  find 

TABLE  59 


Variance  due  to 

Degrees  of 
Freedom. 

Sum  of 
Squares. 

Mean  Square. 

Standard 
Deviation. 

Local  differences 
Treatment 

Experimental  error   . 
Treatment  +  error 

3 
4 

12 

16 

154,483 
40,859 

94,424 
135,283 

51,494 

IO,2I5 

7,869 

8,455 

88-7 
92*0 

The  local  differences  between  the  blocks  are  very- 
significant,  so  that  the  accuracy  of  our  comparisons 
is  much  improved,  in  fact  the  remaining  variance  is 
reduced  almost  to  55  per  cent  of  its  previous  value. 
The  arrangement  arrived  at  by  chance  has  happened 
to  be  a  slightly  unfavourable  one,  the  errors  in  the 
treatment  values  being  a  little  more  than  usual,  while 
the  estimate  of  the  standard  error  is  88-7  against  a 
true  value  92-0.  Such  variation  is  to  be  expected, 
and  indeed  upon  it  is  our  calculation  of  significance 
based. 
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It  might  have  been  thought  preferable  to  arrange 
the  experiment  in  a  systematic  order,  such  as 


ABCDE 


ED  CB  A 


ABCDE 


E  D  C  B  A, 


and,  as  a  matter  of  fact,  owing  to  the  marked  fertility 
gradient  exhibited  by  the  yields  in  the  present  example, 
such  an  arrangement  would  have  produced  smaller 
errors  in  the  totals  of  the  four  treatments.    With  such 
an  arrangement,  however,  we  have  no  guarantee  that 
an  estimate  of  the  standard  error  derived  from  the 
discrepancies   between   parallel  plots  is  really  repre- 
sentative   of   the    differences    produced    between    the 
different  treatments,  consequently  no  such  estimate  of 
the  standard  error  can  be  trusted,  and  no  test  of  signifi- 
cance is  possible.    A  more  promising  way  of  eliminat- 
ing that  part  of  the  fertility  gradient  which   is  not 
included  in  the  differences  between  blocks,  would  be 
to  impose  the  restriction  that  each  treatment  should 
be  "  balanced  "  in  respect  to  position  within  the  block. 
Thus  if  any  treatment  occupied  in  one  block  the  first 
strip,  in  another  block  the  third  strip,  and  in  the  two 
remaining  blocks  the  fourth  strip  (the  ordinal  numbers 
adding  up  to  12),  its  positions  in  the  blocks  would  be 
balanced,  and  the  total  yield  would  be  unaffected  by 
the    fertility  gradient.      Of   the    many  arrangements 
possible  subject  to  this  restriction  one  could  be  chosen, 
and    one    additional    degree  of    freedom    eliminated, 
representing    the    variance    due    to    average    fertility 
gradient    within    the    blocks.      In    the    present    data, 
where  the  fertility  gradient  is  large,  this  would  seem  to 
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give  a  great  increase  in  accuracy,  the  standard  error 
so  estimated  being  reduced  from  92-0  to  73-4.  But 
upon  examination  it  appears  that  such  an  estimate  is 
not  genuinely  representative  of  the  errors  by  which 
the  comparisons  are  affected,  and  we  shall  not  thus 
obtain  a  reliable  test  of  significance. 


49.  The  Latin  Square 

The  method  of  laying  restrictions  on  the  distribu- 
tion of  the  plots  and  eliminating  the  corresponding 
degrees  of  freedom  from  the  variance  is,  however, 
capable  of  some  extension  in  suitably  planned  experi- 
ments. In  a  block  of  25  plots  arranged  in  5  rows  and 
5  columns,  to  be  used  for  testing  5  treatments,  we  can 
arrange  that  each  treatment  occurs  once  in  each  row, 
and  also  once  in  each  column,  while  allowing  free 
scope  to  chance  in  the  distribution  subject  to  these 
restrictions.  Then  out  of  the  24  degrees  of  freedom, 
4  will  represent  treatment  ;  8  representing  soil  differ- 
ences between  different  rows  or  columns,  may  be 
eliminated  ;  and  12  will  remain  for  the  estimation  of 
error.  These  12  will  provide  an  unbiassed  estimate  of 
the  errors  in  the  comparison  of  treatments  provided 
that  every  pair  of  plots,  not  in  the  same  row  or  column, 
belong  equally  frequently  to  the  same  treatment. 

Ex.  46.  Doubly  restricted  arrangements. — The 
following  root  weights  for  mangolds  were  found  by 
Mercer  and  Hall  in  25  plots  ;  we  have  distributed 
letters  representing  5  different  treatments  in  such  a 
way  that  each  appears  once  in  each  row  and  column. 
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TABLE  60 


Total  of  Row. 

D376 

B  316 
C  326 

E  317 
A  321 

Total  of  |     ,  , 
column  J        -* 

E  371 
D338 
A  326 
B  343 

c  332 

C  355 
E336 
B  335 
A  330 

D317 

B  356 
A  356 
D  343 
C  327 
E  318 

A  335 

c  332 

E  330 
D336 
B  306 

1793 
1678 
1660 
1653 
1594 

1710 

1673 

1700 

1639 

8378 

Analysing  out  the  contributions  of  rows,  columns, 
and  treatments  we  have 


TABLE  61 


Differences  between 

Degrees  of 
Freedom. 

Sum  of 
Squares. 

Mean  Square. 

S.D. 

Rows 
Columns 
Treatments 
Remainder 

Total      . 

4 

4 

4 

12 

24 

4240*24 
701*84 
330'24| 

I754-32J 

I30-3 
292-8 

II-4I 
17*11 

7026-64 

By  eliminating  the  soil  differences  between  different 
rows  and  columns,  the  mean  square  has  been  reduced 
to  less  than  half,  and  the  value  of  the  experiment  as 
a  means  of  detecting  differences  due  to  treatment  is 
therefore  more  than  doubled.  This  method  of  equal- 
ising the  rows  and  columns  may  with  advantage  be 
combined  with  that  of  equalising  the  distribution  over 
different  blocks  of  land,  so  that  very  accurate  results 
may  be  obtained  by  using  a  number  of  blocks  each 
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arranged  in,  for  example,  5  rows  and  columns.  In 
this  way  the  method  may  be  applied  even  to  cases 
with  only  three  treatments  to  be  compared.  Further, 
since  the  method  is  suitable  whatever  may  be  the 
differences  in  actual  fertility  of  the  soil,  the  same 
statistical  method  of  reduction  may  be  used  when,  for 
instance,  the  plots  are  25  strips  lying  side  by  side. 
Treating  each  block  of  five  strips  in  turn  as  though  they 
were  successive  columns  in  the  former  arrangement, 
we  may  eliminate,  not  only  the  difference  between  the 
blocks,  but  such  differences  as  those  due  to  a  fertility 
gradient,  which  affect  the  yield  according  to  the  order 
of  the  strips  in  the  block.  When,  therefore,  the 
number  of  strips  employed  is  the  square  of  the  number 
of  treatments,  each  treatment  can  be  not  only  balanced 
but  completely  equalised  in  respect  to  order  in  the 
block,  and  we  may  rely  upon  the  (usually)  reduced 
value  of  the  standard  error  obtained  by  eliminating 
the  corresponding  degrees  of  freedom.  Such  a  double 
elimination  may  be  especially  fruitful  if  the  blocks  of 
strips  coincide  with  some  physical  feature  of  the  field 
such  as  the  ploughman's  "  lands,"  which  often  pro- 
duce a  characteristic  periodicity  in  fertility  due  to 
variations  in  depth  of  soil,  drainage,  and  such  factors. 
To  sum  up  :  systematic  arrangements  of  plots  in 
field  trials  should  be  avoided,  since  with  these  it  is 
usually  possible  to  estimate  the  experimental  error  in 
several  different  ways,  giving  widely  different  results, 
each  way  depending  on  some  one  set  of  assumptions 
as  to  the  distribution  of  natural  fertility,  which  may 
or  may  not  be  justified.     With  unrestricted  random 
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arrangement  of  plots  the  experimental  error,  though 
accurately  estimated,  will  usually  be  unnecessarily 
large.  In  a  well-planned  experiment  certain  restric- 
tions may  be  imposed  upon  the  random  arrangement 
of  the  plots  in  such  a  way  that  the  experimental  error 
may  still  be  accurately  estimated,  while  the  greater 
part  of  the  influence  of  soil  heterogeneity  may  be 
eliminated. 

It  may  be  noted  that  when,  by  an  improved  method 
of  arranging  the  plots,  we  can  reduce  the  standard 
error  to  one-half,  the  value  of  the  experiment  is 
increased  at  least  fourfold  ;  for  only  by  repeating  the 
experiment  four  times  in  its  original  form  could  the 
same  accuracy  have  been  attained.  This  argument 
really  underestimates  the  preponderance  in  the  scien- 
tific value  of  the  more  accurate  experiments,  for,  in 
agricultural  plot  work,  the  experiment  cannot  in 
practice  be  repeated  upon  identical  climatic  and  soil 
conditions. 


IX 


THE  PRINCIPLES  OF  STATISTICAL 
ESTIMATION 

50.  The  practical  importance  of  using  satisfactory 
methods  of  statistical  estimation,  and  the  widespread 
use  in  statistical  literature  of  inefficient  statistics,  in 
the  sense  explained  in  Section  3,  makes  it  necessary  for 
the  research  worker,  in  interpreting  his  own  results, 
or  studying  those  reported  by  others,  to  discriminate 
between  those  conclusions  which  flow  from  the  nature 
of  the  biological  observations  themselves,  and  those 
which  are  due  solely  to  faulty  methods  of  estimation. 

As  an  example  which  brings  out  the  main  prin- 
ciples of  the  theory,  and  which  does  not  involve  data 
so  voluminous  that  we  cannot  easily  try  out  a  variety 
of  methods,  we  shall  choose  the  estimation  of  linkage 
from  the  progeny  of  self- fertilised  heterozygotes.  Thus 
for  two  factors  in  maize,  Starchy^.  Sugary  and  Green  v. 
White  base  leaf  we  may  have  (W.  A.  Carver's  data) 
such  observations  as  the  following  seedling  counts  : 

TABLE  62 


Starchy. 

Sugary. 

Total. 

Green. 

White. 

Green. 

White. 

1997 

906 

904 

32 

3839 
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51.  The  Significance  of  Evidence  for  Linkage 

It  is  a  useful  preliminary  before  making  a 
statistical  estimate,  such  as  one  of  the  intensity  of 
linkage,  to  test  if  there  is  anything  to  justify  estimation 
at  all.  We  therefore  test  the  possibility  that  the  two 
factors  are  inherited  independently.  If  such  were  the 
case  the  two  factors,  each  segregating  in  a  3  :  1  ratio, 
would  give  the  four  combinations  in  the  ratio  9:3:3:1, 
or  with  expectations,  and  corresponding  contributions 
to%2, 


TABLE 

63 

Expectation 

2159-4 

719-8 

719-8 

239-9 

Difference  (d) 

-  162-4 

+  186-2 

+  184-2 

-207-9 

d2/m    . 

I2-2I 

48-17 

47-14 

180-17 

287-69 

Since  for  3  degrees  of  freedom  the  one  per  cent 
point  is  only  11-34,  the  observed  values  are  clearly  in 
contradiction  to  the  expectations.  Such  a  result  would, 
however,  be  produced  either  by  linkage  or  by  a 
departure  from  the  3  :  1  ratios  ;  the  test  may  be  made 
specific  by  analysing  %2  mt0  lts  components  as  in 
Section  22.  For  this  purpose,  designating  the  four 
observed  frequencies  by  a,  3,  c,  d,  and  their  total  by 
n,  the  deviations  from  expectation  in  the  ratio  of 
starchy  and  sugary  will  be  measured  by 
x=(a+b)  -z(c+d)  =  +95, 
that  of  the  other  factor  by 

y=(a+c)-s(d+d)    =  +87, 
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while  to  complete  the  analysis  we  need 
z=a-?>b  -$c+gci  =  -3145. 
Then  dividing  the  square  of  each  discrepancy  by 
its  sampling  variance,   namely  $n  for  x  and  y,  and 
gn  for  z,  we  have  the  components 

x2  +  3"        •  •  •  -784 

y'-s-a*     •      •      •        -657 

z2~gn        .  .  .       286*273 


Total        .  .       287*714 

agreeing  with  the  former  total  as  nearly  as  its  limited 
accuracy  will  allow.  The  conclusion  is  evident  that 
neither  of  the  single  factor  ratios  is  abnormal,  and  that 
all  but  an  insignificant  fraction  of  the  discrepancy  is 
ascribable  to  linkage.  The  principles  on  which  the 
deviations  x,  y,  and  z  are  constructed  will  be  made 
more  clear  in  Section  55. 

52.  The  Specification  of  the  Progeny  Population  for 
Linked  Factors 

When,  as  in  the  present  case,  the  results  are  to  be 
interpreted  in  terms  of  a  definite  theory,  the  specifica- 
tion of  the  population  consists  merely  in  following  out 
the  logical  consequences  of  that  theory.  The  theory 
we  have  to  consider  is  that  in  both  male  and  female 
gametogenesis,  while  each  gamete  has  an  equal  chance 
of  bearing  the  starchy  or  the  sugary  gene,  and  again 
of  bearing  the  gene  for  green  or  white  base  leaf,  yet 
the  parental  combinations  Starchy  White  and  Sugary 
Green  are  produced  more  frequently  than  the  recom- 
bination classes  Starchy  Green  and  Sugary  White. 
If  the  probability  of  the  two  latter  classes  is/>  in  female 
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gametogenesis  and  p'  in  male  gametogenesis,  the 
probability  of  the  four  types  of  ovules  and  of  pollen 
will  be 

TABLE  64 


Starchy. 

Sugary. 

Green. 

White. 

Green. 

White. 

Ovules 
Pollen 

\P 

w 

*<I  ~P) 
1(1  ~f) 

*(I  ~P) 
i(l  "/) 

IP 

Assuming  as  part  of  the  hypothesis  that  each 
grain  of  pollen  will  with  equal  probability  fertilise 
each  ovule,  and  that  the  seeds  and  seedlings  produced 
will  be  equally  viable,  the  probability  that  a  seedling 
will  be  the  double  recessive  Sugary  White,  which  can 
only  happen  if  both  pollen  and  ovule  carry  these 
characters,  will  be  \ppk '.  The  probability  of  each  of 
the  other  three  classes  of  seedlings  may  be  deduced 
at  once,  for  the  total  probability  of  the  two  Sugary 
classes  is  J  irrespective  of  linkage,  which  leaves 
\{i~pp')  for  the  Sugary  Green  class.  Similarly  the 
probability  of  the  Starchy  White  class  is  \{i  -pp'), 
leaving  \(2+pp')  for  Starchy  Green. 

Since  these  probabilities  involve  only  the  quantity 
pp' ,  it  is  only  of  this  and  not  of  the  separate  values  of 
p.  and  p'  that  the  data  can  provide  an  estimate.  We 
shall  therefore  illustrate  the  problem  of  estimating  the 
unknown  quantity  pp' }  which  we  may  designate  by  6. 
If  p  and  p'  were  equal,  then  -\/0  would  give  the  recom- 
bination fraction  in  both  sexes,  and  if  these  are  unequal 

R 
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it  will  always  give  their  geometric  mean.  The  data 
before  us,  however,  throw  direct  light  only  on  the 
value  of  0.  It  is  to  be  observed  that  in  the  case  of 
coupling,  when  both  dominant  genes  are  received 
from  the  same  grandparent,  exactly  the  same  specifi- 
cation is  used,  only  it  is  i  -  V#  instead  of  s/d  which 
is  to  be  interpreted  as  the  recombination  fraction. 

The  statistical  problem  now  takes  the  definite  form  : 
The  probabilities  of  four  events  are 

i(2+0),  i(i-0),  JO-*),  10 ; 

estimate  the  value  of  the  parameter  6  from  the  observed 
frequencies  a,  b,  c,  d. 

53.  The  Multiplicity  of  Consistent  Statistics 

Nothing  is  easier  than  to  invent  methods  of  estima- 
tion. It  is  the  chief  purpose  of  this  chapter  to  explain 
how  satisfactory  methods  may  be  distinguished  from 
unsatisfactory  ones.  The  late  development  of  this 
branch  of  the  subject  seems  to  be  chiefly  due  to  the 
lack  of  recognition  of  the  number  and  variety  of  the 
plausible  statistics  which  present  themselves.  We 
shall  consider  five  of  these. 

In  our  example  we  may  observe  that  the  probability 

of  the  first  and  fourth  class  increases,  and  that  of  the 

two  other  classes  diminishes  as  6  is  increased.      The 

expression 

a -b - c +d 

will  therefore  afford  a  convenient  estimate  of  9.  To 
make  a  consistent  estimate  on  these  lines,  we  substitute 
the  expected  values 
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n 


-(2+0,     1-0,    1-0,    0), 

4 

for  a,  b,  cy  and  d,  and  finding  the  result  to  be  nd,  we 
define  our  estimate,  T1}  by  the  equation 
nT1  =a  -b  -c  +d. 

Alternatively,  we  might  take  the  expression  for  z 
in  Section  51,  which  appears  there  as  a  measure  of 
linkage  for  the  purpose  of  testing  its  significance  ;  sub- 
stituting the  expected  values,  as  before,  we  obtain 
^(40  -  1),  and  may  define  a  new  estimate,  T2,  by  the 

equation 

tz(4T2  -  1)  =a  -3^  -3^  +  gd 

or  4?zT2  =2a  -2b  -2c  +  lod. 

Obviously  any  number  of  similar  estimates  may  be 
formed  by  the  same  method. 

Instead  of  considering  the  sum  of  the  extreme 
frequencies  a  and  d  we  might  have  considered  their 
product.  The  ratio  of  the  product  ad  to  the  product 
be  clearly  increases  with  0 ;  on  substitution  we  have 
an  equation  for  a  third  estimate  in  the  form 

d(2+d)_ad 

0~^W  'be' 
a  quadratic  equation  of  which  T3  is  taken  to  be  the 
positive  solution. 

As  a  fourth  statistic  we  shall  choose  that  given  by 
the  method  of  maximum  likelihood.  This  method 
consists  in  multiplying  the  logarithm  of  the  number 
expected  in  each  class  by  the  number  observed,  sum- 
ming for  all  classes  and  finding  the  value  of  0  for  which 
the  sum  is  a  maximum. 
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Now, 

a\og  (2+d)+b  log  (i  -d)+c\og  (i  -6)+d\og  6 

may  be  seen,  by  differentiating  with  respect  to  0,  to  be 
a  maximum  if 

a       d  _b  +  c 

2+D+£~I  -6' 

leading  to  the  quadratic  equation 

n62  -  (a  -  2b  -  2c  -  d)6  -  2d  =  o, 

of  which  the  positive  solution,  T4,  satisfies  the  condi- 
tion of  maximum  likelihood. 

Finally,  for  any  value  adopted  for  6,  we  shall  be 
able  to  make  a  comparison  of  observed  with  expected 
frequencies,  and  to  calculate  the  discrepancy,  %2, 
between  them.     In  fact  %2  can  be  expressed  in  the  form 

2     4/  a2         b2  c2       d2\ 

and  the  value  for  which  this  is  a  minimum  will  be 
the   positive   solution   of  the   equation   of  the   fourth 

degree 

a2         d2     b2+c2 
+  —  = 


(2+0)2     B2     (i-0)2' 
a  statistic  which  we  shall  designate  by  T5. 


54.  The  Comparison  of  Statistics  by  means  of  the  Test 
of  Goodness  of  Fit 

All  the  statistics  mentioned,  except  the  last,  are 
easily  calculated.  The  reader  should  calculate  the 
first  four,  and  verify  that  the  value  of  the  fifth  given 
below  approximately  satisfies  its  equation.     For  each 
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of  these  we  may  calculate  the  numbers  expected  in 
the  four  classes  of  seedlings,  and  compare  them  with 
those  observed.  This  is  done  in  Table  65,  where  also 
the  values  of  %2  derived  from  this  comparison  are 
given. 

TABLE  65 

Comparison  of  Five  Statistical  Estimates  of  Linkage 


Method. 

1. 

2. 

3- 

4. 

5. 

T       .          .         . 

•057046 

•045194 

•035645 

•035712 

•035785 

Recombination 

per  cent 

23-88 

21-26 

18-880 

18-898 

18-917 

Observed 

( 

1974-25 

1962-875 

I9537II 

1953-775 

I953-845 

1997 

Numbers          I 

905-00 

916-375 

925'539 

925-475 

925-405 

906 

expected       J 

905-00 

916-375 

925-539 

925-475 

925-405 

904 

I 

5475 

43-375 

34-211 

34-275 

34-345 

32 

X2      •        •        • 

9717 

3-860 

2-0158 

20154 

2-0153 

In  the  actual  values  of  the  estimates  the  first  three 
methods  differ  considerably,  but  the  last  three  are 
closely  alike  ;  so  closely  that  the  expectations  of 
methods  (3)  and  (5)  differ  from  those  of  (4)  by  only 
about  one-fifteenth  of  a  seedling  in  each  class.  In  the 
comparisons  between  the  numbers  expected  and  those 
observed,  the  most  important  discrepancies  are  in  the 
fourth  class,  where  method  (2)  gives  a  large  and 
method  (1)  a  very  large  discrepancy.  The  contrast 
between  the  first  three  methods  in  the  values  of  %2  is 
very  striking.  For  two  degrees  of  freedom — not  three 
because  on  fitting  a  linkage  value  one  degree  should 
be  eliminated — a  value  above  9*21  should  only  occur 
once  in  a  hundred  trials.  The  value  given  by  method 
(2)  is  not  in  itself  significant,  but  since  its  value  is 
nearly  double  that  of  methods  (3),  (4),  and  (5)  we  may 
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be  sure  that  the  test  of  goodness  of  fit,  if  correct  for 
the  latter,  must  be  highly  erroneous  for  method  (2), 
as  well  as  for  method  (1).  The  general  theorem  which 
this  illustrates  is  that  the  test  of  goodness  of  fit  is  only 
valid  when  efficient  statistics  are  used  in  fitting  a 
hypothesis  to  the  data  ;  in  this  case,  as  will  be  seen 
in  the  next  section,  methods  (3),  (4),  and  (5)  are  efficient, 
while  methods  (1)  and  (2)  are  not. 

55.  The  Sampling  Variance  of  Statistics 

A  more  searching  examination  of  the  merits  of 
various  statistics  may  be  made  by  calculating  the 
sampling  variances  of  each.  Since  the  subject  of 
sampling  variance  is  usually  treated  by  somewhat 
elaborate  mathematical  methods,  it  will  be  as  well  to 
give  a  number  of  simple  formulae  by  which  the  majority 
of  ordinary  cases  may  be  treated. 

First,  if  x  is  a  linear  function  of  the  observed 
frequencies,  such  as 

k^a  +  k<J)  +  k3c  -\-k^d} 

then,   designating  the  theoretical   probability  of  any 
class  by  p,  the  mean  value  of  x  will  be 

nS(pk). 

The    random    sampling    variance    of  x    is   given   by 
the  formula 

iv(*)=S(^)-SW,       ...     (A) 

and  if  the  mean  value  of  x  is  zero,  the  variance  of  x 

becomes  simply 

nS(pk2). 
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Further,  if  a  second  linear  function  of  the  fre- 
quencies, y,  is  specified  by  coefficients,  k' ',  then  the 
mean  product  of  the  deviations  of  x  and  y  is 

nS(pkk'). 

In  view  of  this  theorem  the  choice  of  the  linear 
functions  used  for  analysing  %2  in  Section  51  will  no 
longer  appear  arbitrary,  and  the  values  taken  for  their 
sampling  variance  will  be  apparent.  For  the  values 
of  p  are 

4e(9,  3,  3,   0, 
16 

and  for  x  the  values  of  k  are 

giving 

S(^)=o,     S(^2)=3, 

so  that  the  variance  is  3^,  the  value  adopted.  For  y 
we  evidently  have  the  same  values,  with  the  additional 
fact  that  the  mean  value  of  xy  is  zero.     For  z  again 

SO§)=o,  SO§2)=9, 
while  the  mean  values  of  xz  and  yz  are  each  zero.  In 
analysing  x2  mt0  its  components  we  always  use  linear 
functions  of  the  frequencies,  the  mean  value  of  each 
being  zero,  and  such  that  all  the  mean  products  shall 
vanish. 

It  should  be  noted  that  the  mean  of  xy  is  only  zero 
in  the  absence  of  linkage.  When  linkage  is  present  the 
values  of  p  are 

i(2+0,   1-0,   i-0,  0), 
giving 

nS^pkk')  =»(40  - 1), 
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and  the  correlation  between  x  and  y  is 

1(4*  "  0- 
A  statistic  used  for  estimation  will  not  be  a  linear 
function  of  the  frequencies,  for  it  must  tend  to  a  finite 
value  as  the  sample  is  increased  indefinitely  ;    it  will, 
however,  often  be  of  the  form 

T  =  -  (kxa  +  k2b  +  kzc  +  k^d), 
n 

as  in  our  example  are  Tx  and  T2. 

For  such  cases  a  convenient  formula  is 

nV(T)=S(pk2)-6*     .       .       .       .     (B) 

the  statistic  being  supposed  to  be  consistent.     Now 
for  Tlf  k  is  always   +  i  or   -  i,  and  we  have  at  once 

i  -62 

vcto=— , 


while    for 

i(2+0,    I- 


T2,    with    k=\,     -J,     -i 
9,  i-6,  6)  it  is  easy  to  find 

i  +69  -46* 


2|,    and   p  = 


V(T2) 


4n 


These  two  sampling  variances  are  very  different ; 
if  6  is  small  (close  linkage  in  repulsion),  the  variance 
of  T2  is  only  a  quarter  of  that  of  T1}  and  we  may  say 
that  T2  utilises  four  times  as  much  of  the  available 
information  as  does  T2.  This  advantage  diminishes, 
but  persists  over  the  whole  range  of  repulsion  linkages, 
for  at  6=1  the  latter  variance  is  only  three-quarters 
of  the  former.  The  variances  become  equal  at  6  =  J, 
at  which  value  the  coupling  recombination,   1  -  V6, 
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is  about  -29,  and  for  closer  linkage  than  this,  in  the 
coupling  phase,  T±  is  the  better  statistic. 

The  standard  error  to  which  either  estimate,  T, 
is  subject  is,  of  course,  found  by  taking  the  square 
root  of  the  variance  ;  it  will  be  of  more  practical 
interest  to  find  the  standard  error  of  the  recombination 
fraction,  yd.  For  this  purpose  the  above  variances 
are  divided  by  4$,  before  taking  the  square  root. 
Putting  #=-0357,  in  the  variances,  we  then  have  the 
two  estimates  of  the  recombination  percentage, 

23-88  ±4-268  and  21-26  ±2-348, 

from  the  first  of  which  we  might  judge  roughly  that 
the  recombination  present  lay  between  15-3  and  32-4, 
while  the  second  indicates  the  much  closer  limits  16-6 
to  26-0. 

For  any  function  of  the  frequencies,  whether  the 
sample  number  n  appears  explicitly  or  not,  we  can 
obtain  the  approximation  to  the  sampling  variance 
appropriate  to  the  theory  of  large  samples  in  the  form 

>>-!>©'!-(£)'   ■  ■  <b 

a  formula  which  involves  the  differential  coefficients  of 
the  function  in  question  with  respect  to  each  observed 
frequency,  and  to  the  total,  n.  After  differentiation 
the  expectation  pn  is  substituted  for  each  frequency  a. 
If  we  apply  formula  (C)  to  the  function 

F  -log  (ad)  -log  (be)  =log  {T3(2  +T3)}  -2  log  (1  -T3), 

the  values  of  dFjda  are 

1  1       _  1       1 

a  b  c      a 
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while,  since  n  does  not  appear  explicitly,  dF/dn=o. 
Hence,  substituting  pn  for  a,  and  the  known  values  of 
p  in  terms  of  0,  we  have 

nV(F)-    I     +    2     +1-     2(I+2^       - 

4  w   2+0   i-e   e  0(1  -0)(2+0)' 

To  obtain  the  variance  of  T3  we  must  divide  this 
by  the  square  of  dFjdT3,  putting  T3  equal  to  0  after 
differentiation  ;  but 


d¥ 
dT3" 

I 

= =T+- 

2  +T3       ] 

2 

-T, 

1 

<n\TC 

r^2*0 

-6X2+6) 

hence 


v    0/  I  +20 

For  the  variance  of  the  statistic  which  satisfies  the 
conditions  of  maximum  likelihood  a  very  simple  and 
direct  general  method  is  available.  The  expression 
obtained  by  direct  differentiation,  and  which,  equated 
to  zero,  gave  the  equation  for  T4  in  Section  53,  was 

a        b  +c     d 
+ 


2  + 


1  -6    6 


If  this  is  differentiated  again  with  respect  to  0,  and 

the  expected  values  substituted  for  a,  b,  c,  and  d}  we 

obtain 

nil  2       i\ 

~4\2T0  +  r^0  +  0/ ; 

and  this  is  simply  equated  to  -  i/V^T^,  giving 

I  +20 

the   same   expression   as   we   have   obtained   for  the 
sampling  variance  of  T3.     This  expression  is  of  great 
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importance  for  our  problem,  for  it  has  been  proved 
that  no  statistic  can  have  a  smaller  sampling  variance, 
in  the  theory  of  large  samples,  than  has  the  solution 
of  the  equation  of  maximum  likelihood.  This  group 
of  statistics  (to  which  the  minimum  %2  solution  also 
always  belongs),  which  agree  in  their  sampling  variance 
with  the  maximum  likelihood  solution,  are  therefore  of 
particular  value,  and  are  designated  efficient  statistics, 
on  the  ground  that  for  large  samples  they  may  be 
said  to  make  use  of  the  whole  of  the  relevant  information 
available,  whereas  less  efficient  statistics  such  as  Tx  and 
T2  utilise  only  a  portion  of  it. 

The  expression  for  the  minimum  variance 

20(l  -fl)(2+fl) 
(i  +20)7* 

represents,  therefore,  an  intrinsic  property  of  the  data, 
irrespective  of  the  methods  of  estimation  actually  used. 
For  large  samples  we  may  interpret  its  reciprocal 

(1  +2&)n 

20(1  -0)(2+0) 

as  a  numerical  measure  of  the  total  amount  of  informa- 
tion, relevant  to  the  value  of  0,  which  the  sample 
contains  ;  and  it  is  evident  that  each  seedling  observed 
contributes  a  definite  amount  of  information,  measured 
by 

I  +20 


20(1  -0)(2+0)' 

relevant  to  the  estimation  of  the  value  of  0.  This 
consideration  affords  a  basis  for  the  exact  treatment 
of  sampling  problems  even  for  small  samples,  for  once 
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the  amount  of  information  in  the  data  can  be  calculated, 
the  amount  extracted  by  any  proposed  method  of 
analysis  may  be  evaluated  likewise,  though  this  may 
be  difficult,  and  a  comparison  of  the  two  quantities 
gives  an  objective  measure  of  the  efficiency  of  the 
method  proposed  in  conveying  the  relevant  information 
available. 

The  actual  fraction  of  the  information  utilised  by 
inefficient  statistics  in  large  samples  is  obtained  by 
expressing  the  random  sampling  variance  of  efficient 
statistics  as  a  fraction  of  that  of  the  statistic  in  question. 
Thus  for  Tj  and  T2  we  have  the  fractions, 

20(2  +  0) 


E(T1)=V(T4)-V(T1) 


(I  +  20)(l  +0)' 


which  rises  to  unity  at  0  =  i,  but  is  less  at  all  other 
values  ;  and 

which  rises  to  unity  at  0=£,  falling  to  zero  if  0  =  o, 
or  0  =  i. 

Fig.  1 1  shows  the  course  of  these  fractions  ex- 
pressed as  a  percentage,  for  all  values  of  the  recombina- 
tion percentage,  V  0  for  repulsion,  and  i  -  \/ 6  for 
coupling.  It  will  be  seen  that  for  our  actual  value  of 
about  19  per  cent  in  repulsion,  the  efficiency  of  Tx  is 
about  13  per  cent,  while  that  of  T2  is  about  44  per 
cent.  The  use  of  Tx  wastes  about  seven-eighths  of 
the  information  utilised  by  T3,  T4,  and  T5,  while 
the  use  of  T2  wastes  more  than  half  of  it.  In  other 
words,  Tx  is  only  as  good  an  estimate  as  should  be 


STATISTICAL  ESTIMATION 


253 


obtained  from  a  count  of  503  seedlings,  while  T2  is  as 
good  as  should  be  obtained  from  1661  out  of  the  3839 
actually  counted. 

The  standard  error  of  the  efficient  estimates  of 
recombination  value  is  1-545  per  cent,  giving  probable 
limits  of  15-8  to  220  for  the  true  value.  The  use  of 
inefficient    statistics    is    therefore    liable    to    give    not 
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-Efficiency  of  Tx  and  T2  for  all  values  of  d.  T3,  T4,  and  T5 
having  100  per  cent  efficiency  throughout  the  range,  are 
represented  by  the  upper  line. 


merely  inferior  estimates  of  the  value  sought,  but 
estimates  which  are  distinctly  contradicted  by  the 
data  from  which  they  are  derived.  The  value  23-88 
per  cent  obtained  for  Tx  differs  from  the  better  esti- 
mates by  more  than  three  times  the  standard  error  of 
the  latter.  It  is  highly  misleading  to  derive  such  an 
estimate  from  data  which  themselves  prove  it  to  be 
erroneous. 

The  second  respect  in  which  the  use  of  inefficient 
statistics  is  liable  to  be  misleading  is  in  the  use  of  the 
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X2  test  of  goodness  of  fit.  Using  Tlt  we  should 
naturally  be  led  to  conclude  that  the  simple  hypo- 
thesis of  linked  factors  was  in  ill  accord  with  the 
observations  and  that  the  results  must  be  complicated 
by  some  such  additional  factor  as  differential  viability. 
Finding  only  32  double  recessives  against  an  expecta- 
tion of  55  it  would  be  natural  to  draw  the  conclusion 
that  this  genotype  suffered  from  a  low  viability  ; 
whereas  the  data  rightly  interpreted  give  no  significant 
indication  of  this  sort.  In  the  second  place,  whether 
the  discrepancy  is  ascribed  to  differential  viability  or 
not,  its  existence  would  provide  a  very  good  reason 
for  distrusting  the  linkage  value  obtained  from  such 
data  ;  if,  on  the  contrary,  satisfactory  methods  of 
estimation  are  used,  the  grounds  for  this  distrust  are 
seen  to  fall  away. 


56.  Comparison  of  Efficient  Statistics 

It  has  been  seen  that  the  three  efficient  statistics 
tested  give  closely  similar  results.  This  is  in  accordance 
with  a  general  theorem  that  the  correlation  between 
any  two  efficient  statistics  tends  to  +1,  as  the  sample 
is  indefinitely  increased.  The  conclusions  drawn  from 
their  use  will  therefore  ordinarily  be  equivalent.  It 
appears  from  Fig.  1 1  that,  for  special  values  of  9, 
Tj  and  T2  also  rank  as  efficient. 

T2  is  efficient  when  6  is  J,  or  in  the  absence  of 
linkage.  This  accords  with  the  use  of  z  in  Section  51 
for  testing  the  significance  of  linkage,  for  we  are  then 
testing  the  hypothesis  that  the  factors  are  unlinked, 
and  the  test  may  be  applied  simply  by  seeing  whether 
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or  not  z2  exceeds  (say)  $6n.  Any  test  based  upon  an 
efficient  estimate  of  linkage  compared  to  its  standard 
error  must  agree  with  this.  It  is  by  no  means  un- 
common to  find  statistics  such  as  T2  which  provide 
excellent  tests  of  significance,  yet  which  become 
highly  inefficient  in  estimating  the  magnitude  of  a 
significant  effect.  An  outstanding  example  of  this  is 
the  use  of  the  third  and  fourth  moments  to  measure 
the  departure  from  normality  of  a  frequency  curve. 
The  third  and  fourth  moments  provide  excellent  tests 
of  the  significance  of  the  departures  from  normality, 
but  when  the  distribution  is  one  of  the  Pearsonian 
types  differing  considerably  from  the  normal,  the  third 
and  fourth  moments  are  very  inefficient  statistics  to 
use  in  estimating  the  form  of  the  curve.  This  is  the 
more  noteworthy  as  the  method  of  moments  is  ordin- 
arily used  for  this  purpose.  The  fact  is  that  the 
efficiency  of  each  of  these  statistics  rises  to  100  per  cent 
only  for  the  normal  form,  just  as  that  of  T2  reaches 
100  per  cent  only  for  zero  linkage  ;  but  that  the 
efficiency  depends  on  the  form  of  the  curve,  just  as 
that  of  T2  depends  on  the  value  of  6,  and  falls  rapidly 
away  as  we  leave  the  special  region  of  high  efficiency. 
The  statistic,  Tlf  is  fully  efficient  when  0  =  i, 
that  is,  for  very  high  linkage  in  the  coupling  phase  ; 
and  therefore  in  the  theory  of  large  samples  should 
give  an  estimate  equivalent  to  T3,  T4,  and  T5.  This 
extreme  case,  0  =  i,  is  interesting  in  bringing  out  a 
limitation  of  the  theory  of  large  samples,  which  it  is 
sometimes  important  to  bear  in  mind  ;  for  the  theory 
is  valid  only  if  none  of  the  numbers  counted,  a,  b,  c, 
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and  d,  are  very  small.  Now  for  high  linkage  in  coup- 
ling the  recombination  types,  b  and  c,  may  be  very 
scarce.  It  is  true  that  for  any  proportion  of  crossing 
over,  however  small,  it  is  possible  theoretically  to  take 
a  sample  so  big  that  b  and  c  will  be  large  enough 
numbers  ;  and  in  such  cases  the  theory  of  large 
samples  is  justified.  But  it  is  also  true,  for  a  sample 
of  any  given  size,  that  linkage  may  be  so  high  that 
seedlings  of  types  b  and  c  will  be  few  ;  then,  it  is  easy 
to  see  that  some  of  the  efficient  statistics  will  fail.  If, 
for  example,  either  b  or  c  is  zero,  T3  will  necessarily 
be  unity,  indicating  complete  linkage,  whereas  two  or 
three  seedlings  in  the  other  recombination  class  will 
show  that  crossing  over  has  really  taken  place.  In 
the  same  way  T5  also  fails,  for  it  makes  the  recombina- 
tion fraction  proportional  to  \/b2  +c2,  while  Tx  and  T4 
make  it  proportional  to  b  +c.  In  general  the  equation 
for  minimising  x2  is  never  satisfactory  when  some  of 
the  classes  are  thinly  occupied,  as  one  might  expect 
from  the  nature  of  %2  ;  this  method  therefore  fails 
whenever  the  number  of  classes  is  infinite,  as  it 
usually  is  when  we  are  concerned  with  the  distributions 
of  continuous  variates.  The  two  remaining  efficient 
statistics  Tx  and  T4  give  equivalent  estimates 

b  +  c 

n 

for  the  recombination  fraction,  when  the  linkage  is 
very  high.  Of  course,  as  shown  by  Fig.  11,  for  any 
incomplete  linkage  the  efficiency  of  Tx  is  slightly 
below  100  per  cent,  so  that  the  exact  value  of  T4  is 


STATISTICAL  ESTIMATION 


257 


slightly  preferable.  Tlf  however,  does  provide  a 
distinctly  better  estimate  than  T3  or  T5  if  b  and  c  are 
small. 


57.  The  Interpretation  of  the  Discrepancy  ^2 

The  statistic  obtained  by  the  method  of  maximum 
likelihood  stands  in  a  peculiar  relation  to  the  measure 
of  discrepancy,  %2,  and  an  examination  of  this  relation 
will  serve  to  illuminate  the  method,  using  degrees  of 
freedom,  which  we  have  adopted  in  Chapter  IV.,  and 
throughout  the  book.  It  has  been  stated  that  although 
in  the  distribution  of  a  given  number  of  individuals 
among  four  classes  there  are  three  degrees  of  freedom, 
yet  if,  as  in  the  present  problem,  the  expected  numbers 
have  been  calculated  from  those  observed  by  means  of 
an  adjustable  parameter  (6),  then  only  2  degrees  of 
freedom  remain  in  which  observation  can  differ  from 
hypothesis.  Consequently  the  value  of  x2  calculated 
in  such  a  case  is  to  be  compared  with  the  values 
characteristic  of  its  distribution  for  2  degrees  of 
freedom.  This  principle  has  been  disputed,  but  the 
common-sense  considerations  upon  which  it  was  based 
have  since  received  complete  theoretical  verification. 
In  the  present  instance  we  can  in  fact  identify  the  two 
degrees  of  freedom  concerned.  For  the  observed 
numbers  in  each  class  will  be  entirely  specified  if  we 
know  : 

(i.)  The  number  in  the  sample  ; 
(ii.)  The  ratio  of  starchy  to  sugary  plants; 
(iii.)   The  ratio  of  green  to  white  base  leaf; 
(iv.)  The  intensity  of  linkage. 
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Now  if  the  expected  series  agrees  in  items  (i.)  and 

(iv.),  it  can  only  differ  in  items  (ii.)  and  (iii.)  and  these 

will  be  completely  given  by  the  two  quantities  x  andjy 

defined  by 

x  =a  +  b  -$c  -3^, 

y  =a  -^b  +c-sd, 

specifying    the    ratios    by    linear    functions    of    the 
frequencies. 

The  mean  values  of  x  and  y  are  zero,  and  the 
random  sampling  variance  of  each  is  $n.  In  the 
absence  of  linkage  their  deviations  will  be  independent, 
but  if  linkage  is  present  the  mean  value  of  xy  has 

been  found  to  be 

1  -40 
-3» — — > 


') 


and  the  correlation  between  x  and  y  to  be 

The  simultaneous  deviations  of  x  and  y  from  zero 
will  therefore  be  measured  (compare  Section  30)  by 

x2  -2pxy  +y2\ 


Q2  = 


i-p2{  2>n  ( 

= 1 Ax2  +y2  +  20  "  *P)xy 


This  expression,  which  of  course  depends  upon  6, 
is  a  quadratic  function  of  the  frequencies  ;  in  this  it 
resembles  ^2,  and  on  comparing  term  by  term  the  two 
expressions  it  appears  that 


x2  =  Q2  + 


I 


1  2  + 


b  +c     d 


}"■ 
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where  I  is  the  quantity  of  information  contained  in  the 
data  as  defined  in  Section  55. 

This  identity  has  two  important  consequences  ; 
first  that  %2  =  Q2  for  the  particular  value  of  6  given  by 
the  equation  of  maximum  likelihood,  and  for  no  other 
value.  At  this  point,  then,  even  for  finite  samples,  the 
deviations  between  observation  and  expectation  repre- 
sent precisely  the  deviations  in  the  two  single  factor 
ratios. 

The  second  point  is  that  for  any  value  of  9,  x2 
is  the  sum  of  two  positive  parts  of  which  one  is  Q2, 
while  the  other  measures  the  deviation  of  the  value  of 
0  considered  from  the  maximum  likelihood  solution  ; 
this  latter  part  is  the  contribution  to  x2  °f  errors  of 
estimation,  while  the  agreement  of  observation  with 
hypothesis  is  measured  by  Q2  only. 

Fig.  12  shows  the  values  of  %2  and  Q2  over  the 
region  covering  the  three  efficient  solutions. 

The  contact  of  the  graphs  at  the  maximum  likeli- 
hood solution,  makes  it  evident  why  the  solution  based 
on  minimum  x2  should  be  of  no  special  interest, 
although  x2  is  a  valid  measure  of  discrepancy  between 
observation  and  hypothesis.  As  the  hypothetical 
value,  6,  is  changed  the  value  of  Q2  changes,  and, 
although  this  change  is  very  minute,  it  gives  the  line 
a  sufficient  slope  to  make  an  appreciable  shift  in  the 
point  of  contact. 

If  we  set  aside  the  portion  ascribable  to  errors  of 
estimation,  which  satisfactory  methods  of  estimation 
will  always  reduce  to  a  trifling  amount,  it  is  apparent 
that  the  measure  of  discrepancy  ^2,   in   our  chosen 
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problem,  merely  measures  the  deviation  from  expecta- 
tion of  the  two  single  factor  ratios,  and  its  significance 


AONVdaaosia  do  samvA 


must  therefore  be  judged  by  comparison  with  expecta- 
tion for  two  degrees  of  freedom.     Such  a  comparison 
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will  give  an  objective  test  dependent  only  on  the  data, 
and  independent  of  our  methods  of  treating  it,  if  and 
only  if  the  error  of  estimation  measured  from  the 
maximum  likelihood  solution  is  sufficiently  small. 
This,  of  course,  where  the  theory  of  large  samples  is 
applicable,  will  be  true  if  any  efficient  statistic  is  used  ; 
it  will  always  be  true  for  the  method  of  maximum 
likelihood. 

58.  Summary  of  Principles 

In  any  problem  of  estimation  innumerable  methods 
may  be  invented  arbitrarily,  all  of  which  will  tend  to 
give  the  correct  results  as  the  available  data  are  increased 
indefinitely.  Each  of  these  methods  supplies  a  formula 
from  which  a  statistic,  intended  as  an  estimate  of  the 
unknown,  can  be  calculated  from  the  observed  fre- 
quencies.   These  statistics  are  of  very  different  value. 

A  test  of  five  such  statistics  in  a  simple  genetical 
problem  has  shown  that  a  particular  group  of  them 
give  closely  concordant  results,  while  the  estimates 
obtained  by  the  remainder  are  discrepant.  This  dis- 
crepancy is  particularly  marked  in  the  misleading 
values  found  for  ^2. 

An  examination  of  the  sampling  errors  shows  that 
the  concordant  group  have  in  large  samples  a  variance 
equal  to  that  of  the  maximum  likelihood  solution,  and 
therefore  as  small  as  possible.  These  are  efficient 
statistics  ;  the  variances  of  the  inefficient  statistics  are 
larger,  and  may  be  so  large  that  their  values  are  quite 
inconsistent  with  the  data  from  which  they  are 
derived. 
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Efficient  statistics  give  closely  equivalent  results  if 
the  samples  are  sufficiently  large,  but  when  the  theory 
of  large  samples  no  longer  holds,  such  statistics,  other 
than  that  obtained  by  the  method  of  maximum 
likelihood,  may  fail. 

The  measure  of  discrepancy  ^2  may  be  divided 
into  two  parts,  one  measuring  the  real  discrepancy 
between  observation  and  hypothesis,  while  the  other 
measures  merely  the  discrepancy  between  the  value 
adopted  and  that  given  by  the  method  of  maximum 
likelihood. 

The  amount  of  information  supplied  by  the  data 
is  capable  of  exact  measurement,  and  the  fraction  of 
the  information  available  which  is  utilised  by  any 
inefficient  statistic,  can  thereby  be  calculated.  The 
same  method  may,  though  more  laboriously,  be 
applied  to  compare  efficient  statistics  when  the  sample 
of  data  is  small. 

It  will  be  readily  understood  that  the  extensive 
investigation  which  we  have  given  to  a  somewhat 
trivial  genetical  problem  is  not  necessary  to  its 
practical  treatment.  Its  purpose  has  been  to  elucidate 
principles  which  are  applicable  to  all  problems  involv- 
ing statistical  estimation.  In  practice  one  need  seldom 
do  more  than  solve,  at  least  to  a  good  approximation, 
the  equation  of  maximum  likelihood,  and  calculate 
the  sampling  variance  of  the  estimate  so  obtained. 
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